<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=US-ASCII">
<META content="MSHTML 6.00.2900.2604" name=GENERATOR></HEAD>
<BODY id=role_body style="FONT-SIZE: 10pt; COLOR: #000000; FONT-FAMILY: Arial"
bottomMargin=7 leftMargin=7 topMargin=7 rightMargin=7><FONT id=role_document
face=Arial color=#000000 size=2>
<DIV>Hello</DIV>
<DIV> </DIV>
<DIV>I would be garteful for any help with this.</DIV>
<DIV> </DIV>
<DIV>I want to pull an id number (UniProt protein accession number) from a file
using a regex. This works OK.</DIV>
<DIV>I then wanted to use the number as part of a url to pull the relevant page
back , so I could parse some information about the protein from the page.</DIV>
<DIV>The code is very basic.</DIV>
<DIV> </DIV>
<DIV>My perl script:</DIV>
<DIV> </DIV>
<DIV>#!/usr/bin/perl</DIV>
<DIV><BR># A script to pull out an id number from a file using a regex.<BR>#The
id number(s0 are put into an array @accnumber.<BR>#The file I read in is
html_test2.txt (attached to this mail).<BR>#Then use the id number as part of a
url to get and store a webpage.<BR>#In this case to simplify things I just want
to take the first <BR>#element of the @accnumber array and use that in the
url</DIV>
<DIV><BR>use LWP::Simple;</DIV>
<DIV><BR>$a = 0;</DIV>
<DIV> </DIV>
<DIV> #ask for the file name </DIV>
<DIV> </DIV>
<DIV>print "please enter file name", "\n"; </DIV>
<DIV> </DIV>
<DIV> #open and read the file</DIV>
<DIV> </DIV>
<DIV><BR>$filename1 = <>;</DIV>
<DIV> </DIV>
<DIV>open fileone, "$filename1"<BR> or die;</DIV>
<DIV> </DIV>
<DIV>while (!eof(fileone))</DIV>
<DIV> </DIV>
<DIV> {</DIV>
<DIV> </DIV>
<DIV>my $line = <fileone>;</DIV>
<DIV> </DIV>
<DIV><BR>if ( $line =~/UNIPROT:?\w+\s(\w{6})\s/)</DIV>
<DIV> </DIV>
<DIV>{</DIV>
<DIV> </DIV>
<DIV>@accnumber[$a]= $1."\n";<BR>$a++;</DIV>
<DIV> </DIV>
<DIV>}</DIV>
<DIV> </DIV>
<DIV> }</DIV>
<DIV> </DIV>
<DIV><BR>close fileone;</DIV>
<DIV> </DIV>
<DIV><BR>$query_number = @accnumber[0]; <BR> <BR> #as
a sanity check I print the number to STDOUT</DIV>
<DIV> </DIV>
<DIV>print $query_number;</DIV>
<DIV> </DIV>
<DIV> #I call the subroutine to return the webpage</DIV>
<DIV> </DIV>
<DIV>get_page($query_number);</DIV>
<DIV> </DIV>
<DIV><BR>sub get_page {</DIV>
<DIV> </DIV>
<DIV>my $address = $_[0];</DIV>
<DIV> </DIV>
<DIV><BR>my $url =
'http://www.ebi.uniprot.org/uniprot-srv/xmlView.do?proteinId='<BR>.$address<BR>.'_ORYSA&pager.offset=0';</DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV>my $html_file = 'page.html';<BR>my $status = getstore($url,
$html_file);<BR>die "No <A href=":Error">URL::Error</A>" unless
is_success($status);</DIV>
<DIV> </DIV>
<DIV> }</DIV>
<DIV><BR>exit;</DIV>
<DIV> </DIV>
<DIV>and the text file I parse to get my regex:</DIV>
<DIV> </DIV>
<DIV>BLASTP 2.0MP-WashU [13-Dec-2004] [decunix5.0a-ev6-IP32LF64
2004-12-15T17:03:39]</DIV>
<DIV> </DIV>
<DIV>Copyright (C) 1996-2004 Washington University, Saint Louis, Missouri
USA.<BR>All Rights Reserved.</DIV>
<DIV> </DIV>
<DIV>Reference: Gish, W. (1996-2004) <A
href="http://blast.wustl.edu">http://blast.wustl.edu</A></DIV>
<DIV> </DIV>
<DIV>Query= 24061 17154533 emb|CAC80823.1 (AJ251791) putative IAA1
protein [Oryza<BR> sativa]
1e-130 235 236 99.5% top
hit<BR> (237 letters; record 1)</DIV>
<DIV> </DIV>
<DIV>Database:
uniprot<BR>
1,880,849 sequences; 604,459,357 total
letters.<BR>Searching....10....20....30....40....50....60....70....80....90....100%
done</DIV>
<DIV> </DIV>
<DIV>
Smallest<BR>
Sum<BR>
High Probability<BR>Sequences producing High-scoring Segment
Pairs:
Score P(N) N</DIV>
<DIV> </DIV>
<DIV>UNIPROT:Q75KX3_ORYSA Q84PD9 Putative auxin-responsive pro...
1203 1.2e-121 1</DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV><BR>All Rights Reserved.</DIV>
<DIV> </DIV>
<DIV>Reference: Gish, W. (1996-2004) </DIV>
<DIV><U><FONT color=#0000ff></FONT></U> </DIV>
<DIV><U><FONT color=#0000ff></FONT></U> </DIV>
<DIV><U><FONT color=#0000ff>Thanks for any help.</FONT></U></DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV> </DIV></FONT></BODY></HTML>