<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML><HEAD>

<META http-equiv=Content-Type content="text/html; charset=US-ASCII">

<META content="MSHTML 6.00.2900.2604" name=GENERATOR></HEAD>

<BODY id=role_body style="FONT-SIZE: 10pt; COLOR: #000000; FONT-FAMILY: Arial" 

bottomMargin=7 leftMargin=7 topMargin=7 rightMargin=7><FONT id=role_document 

face=Arial color=#000000 size=2>

<DIV>Hello</DIV>

<DIV> </DIV>

<DIV>I would be garteful for any help with this.</DIV>

<DIV> </DIV>

<DIV>I want to pull an id number (UniProt protein accession number) from a file 

using a regex.  This works OK.</DIV>

<DIV>I then wanted to use the number as part of a url to pull the relevant page 

back , so I could parse some information about the protein from the page.</DIV>

<DIV>The code is very basic.</DIV>

<DIV> </DIV>

<DIV>My perl script:</DIV>

<DIV> </DIV>

<DIV>#!/usr/bin/perl</DIV>

<DIV><BR># A script to pull out an id number from a file using a regex.<BR>#The 

id number(s0 are put into an array @accnumber.<BR>#The file I read in is 

html_test2.txt (attached to this mail).<BR>#Then use the id number as part of a 

url to get and store a webpage.<BR>#In this case to simplify things I just want 

to take the first <BR>#element of the @accnumber array and use that in the 

url</DIV>

<DIV><BR>use LWP::Simple;</DIV>

<DIV><BR>$a = 0;</DIV>

<DIV> </DIV>

<DIV>    #ask for the file name </DIV>

<DIV> </DIV>

<DIV>print "please enter file name", "\n"; </DIV>

<DIV> </DIV>

<DIV>    #open and read the file</DIV>

<DIV> </DIV>

<DIV><BR>$filename1 = <>;</DIV>

<DIV> </DIV>

<DIV>open fileone,  "$filename1"<BR> or die;</DIV>

<DIV> </DIV>

<DIV>while (!eof(fileone))</DIV>

<DIV> </DIV>

<DIV> {</DIV>

<DIV> </DIV>

<DIV>my $line = <fileone>;</DIV>

<DIV> </DIV>

<DIV><BR>if ( $line =~/UNIPROT:?\w+\s(\w{6})\s/)</DIV>

<DIV> </DIV>

<DIV>{</DIV>

<DIV> </DIV>

<DIV>@accnumber[$a]= $1."\n";<BR>$a++;</DIV>

<DIV> </DIV>

<DIV>}</DIV>

<DIV> </DIV>

<DIV> }</DIV>

<DIV> </DIV>

<DIV><BR>close fileone;</DIV>

<DIV> </DIV>

<DIV><BR>$query_number = @accnumber[0]; <BR> <BR>   #as 

a sanity check I print the number to STDOUT</DIV>

<DIV> </DIV>

<DIV>print $query_number;</DIV>

<DIV> </DIV>

<DIV>   #I call the subroutine to return the webpage</DIV>

<DIV> </DIV>

<DIV>get_page($query_number);</DIV>

<DIV> </DIV>

<DIV><BR>sub get_page {</DIV>

<DIV> </DIV>

<DIV>my $address = $_[0];</DIV>

<DIV> </DIV>

<DIV><BR>my $url = 

'http://www.ebi.uniprot.org/uniprot-srv/xmlView.do?proteinId='<BR>.$address<BR>.'_ORYSA&pager.offset=0';</DIV>

<DIV> </DIV>

<DIV> </DIV>

<DIV> </DIV>

<DIV>my $html_file = 'page.html';<BR>my $status = getstore($url, 

$html_file);<BR>die "No <A href=":Error">URL::Error</A>" unless 

is_success($status);</DIV>

<DIV> </DIV>

<DIV> }</DIV>

<DIV><BR>exit;</DIV>

<DIV> </DIV>

<DIV>and the text file I parse to get my regex:</DIV>

<DIV> </DIV>

<DIV>BLASTP 2.0MP-WashU [13-Dec-2004] [decunix5.0a-ev6-IP32LF64 

2004-12-15T17:03:39]</DIV>

<DIV> </DIV>

<DIV>Copyright (C) 1996-2004 Washington University, Saint Louis, Missouri 

USA.<BR>All Rights Reserved.</DIV>

<DIV> </DIV>

<DIV>Reference:  Gish, W. (1996-2004) <A 

href="http://blast.wustl.edu">http://blast.wustl.edu</A></DIV>

<DIV> </DIV>

<DIV>Query=  24061  17154533 emb|CAC80823.1 (AJ251791) putative IAA1 

protein [Oryza<BR>    sativa] 

 1e-130 235 236 99.5% top 

hit<BR>        (237 letters; record 1)</DIV>

<DIV> </DIV>

<DIV>Database:  

uniprot<BR>           

1,880,849 sequences; 604,459,357 total 

letters.<BR>Searching....10....20....30....40....50....60....70....80....90....100% 

done</DIV>

<DIV> </DIV>

<DIV>                                                                     

Smallest<BR>                                                                       

Sum<BR>                                                              

High  Probability<BR>Sequences producing High-scoring Segment 

Pairs:              

Score  P(N)      N</DIV>

<DIV> </DIV>

<DIV>UNIPROT:Q75KX3_ORYSA Q84PD9 Putative auxin-responsive pro...  

1203  1.2e-121  1</DIV>

<DIV> </DIV>

<DIV> </DIV>

<DIV> </DIV>

<DIV><BR>All Rights Reserved.</DIV>

<DIV> </DIV>

<DIV>Reference:  Gish, W. (1996-2004) </DIV>

<DIV><U><FONT color=#0000ff></FONT></U> </DIV>

<DIV><U><FONT color=#0000ff></FONT></U> </DIV>

<DIV><U><FONT color=#0000ff>Thanks for any help.</FONT></U></DIV>

<DIV> </DIV>

<DIV> </DIV>

<DIV> </DIV></FONT></BODY></HTML>