[BiO BB] problem with LWP::Simple
    DMUTANTZ at aol.com 
    DMUTANTZ at aol.com
       
    Sun Jun 26 13:27:35 EDT 2005
    
    
  
Hello
 
I would be garteful for any help with this.
 
I want to pull an id number (UniProt protein accession number) from a file  
using a regex.  This works OK.
I then wanted to use the number as part of a url to pull the relevant page  
back , so I could parse some information about the protein from the page.
The code is very basic.
 
My perl script:
 
#!/usr/bin/perl
# A script to pull out an id number from a file using a regex.
#The  id number(s0 are put into an array @accnumber.
#The file I read in is  html_test2.txt (attached to this mail).
#Then use the id number as part of a  url to get and store a webpage.
#In this case to simplify things I just want  to take the first 
#element of the @accnumber array and use that in the  url
use LWP::Simple;
$a = 0;
 
    #ask for the file name 
 
print "please enter file name", "\n"; 
 
    #open and read the file
 
$filename1 = <>;
 
open fileone,  "$filename1"
or die;
 
while (!eof(fileone))
 
 {
 
my $line = <fileone>;
 
if ( $line =~/UNIPROT:?\w+\s(\w{6})\s/)
 
{
 
@accnumber[$a]= $1."\n";
$a++;
 
}
 
 }
 
close fileone;
 
$query_number = @accnumber[0]; 
#as  a sanity check I print the number to STDOUT
 
print $query_number;
 
   #I call the subroutine to return the webpage
 
get_page($query_number);
 
sub get_page {
 
my $address = $_[0];
 
my $url =  'http://www.ebi.uniprot.org/uniprot-srv/xmlView.do?proteinId='
.$address
.'_ORYSA&pager.offset=0';
 
 
 
my $html_file = 'page.html';
my $status = getstore($url,  $html_file);
die "No _URL::Error_ (:Error) " unless  is_success($status);
 
 }
exit;
 
and the text file I parse to get my regex:
 
BLASTP 2.0MP-WashU [13-Dec-2004] [decunix5.0a-ev6-IP32LF64  
2004-12-15T17:03:39]
 
Copyright (C) 1996-2004 Washington University, Saint Louis, Missouri  USA.
All Rights Reserved.
 
Reference:  Gish, W. (1996-2004) _http://blast.wustl.edu_ 
(http://blast.wustl.edu) 
 
Query=  24061  17154533 emb|CAC80823.1 (AJ251791) putative IAA1  protein 
[Oryza
sativa]  1e-130 235 236 99.5% top  hit
(237 letters; record 1)
 
Database:   uniprot
1,880,849 sequences; 604,459,357 total  letters.
Searching....10....20....30....40....50....60....70....80....90....100%  done
 
                                                                      Smallest
Sum
High  Probability
Sequences producing High-scoring Segment  Pairs:               Score  P(N)    
  N
 
UNIPROT:Q75KX3_ORYSA Q84PD9 Putative auxin-responsive pro...   1203  1.2e-121 
 1
 
 
 
All Rights Reserved.
 
Reference:  Gish, W. (1996-2004) 
 
 
Thanks for any help.
 
 
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bioinformatics.org/pipermail/bbb/attachments/20050626/0c3ee633/attachment.html>
    
    
More information about the BBB
mailing list