Here is a piece of a perl module that we have used for retrieving taxonomy information: # # get_docsum_from_ncbi($query_gi_list) # # Queries the NCBI database to retrieve the document summary for each # gi ID in a comma delimited list ($query_gi_list). This document # summary will be parsed and its values placed into the global hash # variable, $xml_docsum. This hash is keyed on the accession number in # each sequence ID (this is not the same as the gi ID used to retrieve # the summary). Each entry in the hash contains a sub-hash with # name/value pairs from the document summary. The main value we're # interested in is 'TaxId', which is the taxonomy ID of the organism # the sequence comes from. # sub get_docsum_from_ncbi($) { my($query_gi_list) = @_; # NCBI will return its document summary in XML format. Here, we # set up an XML::Parser object with callback functions that will # be used to parse the XML. See the callback functions themselves # for more information. # Note that if you call "new XML::Parser(Style => 'Debug'), you # can see the full structure of the XML document printed to # STDOUT. If you do this, you will need to comment out the # setHandlers method below, since only the default handlers will # output the debug info. my $xml_parser = new XML::Parser(); $xml_parser->setHandlers(Start => \&xml_start_handler, Char => \&xml_char_handler, End => \&xml_end_handler); # NCBI provides tools made for batch utilities to interface with # their database. They call them "Entrez Programming Utilities", or # EUtils for short, and you can find their documentation here: # http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html # # One EUtil is called ESearch, and the following code can use # ESearch to query NCBI with a list of accession numbers (in the # form 'term' => '<accession>+OR+<accession>') and return a list # of gi IDs to be used with other EUtils that only take gi IDs as # input. Unfortunately, the search term is limited in length, and # you can only fit about 20 accession numbers per query. NCBI's # usage rules say you can only perform one query every 3 seconds, # so it would take 25 mins for 10,000 records. Therefore, I had # to drop this idea as a solution, but I leave the code here in # comments for future reference. # #$query_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"; #$response = $browser->post($query_url, # [ 'db' => 'protein', # 'term' => $$query_accession_nums_ref, #'NP_828860.1+OR+P05040', # 'usehistory' => 'n', # 'tool' => NCBI_TOOL, # 'email=' => NCBI_EMAIL, # ]); #if(!$response->is_success) { # print STDERR "Error '" . $response->status_line . # "' retrieving url '" # . $query_url . "'.\n"; # return ""; #} # #$xml_parser->parse($response->content); #chop($query_gi_list); # Cut off trailing comma # Another way to convert accession numbers to gi ids is by using # "batch entrez". This part of NCBI's web site is made to be used # by humans, so scripting it and parsing the output is a lot # harder. However, the advantage of batch entrez over ESearch is # that it can return up to 10,000 results at a time (at least # according to my interpretation of the documentation - I never # tested with that many). Nevertheless, to avoid straining NCBI's # systems and parsing result data from a human interface that # might change format, we decided it was better to just use the gi # ids stored in each a2m file, even though they can become out of # date over time. # # The following is code that can be adapted to getting gi ids using batch # Entrez. The code was written originally as an attempt to find taxonomy # ids, before I knew about EUtils. Note that $$batch_accession_nums_ref # needs to be set to a \n delimited list of accession numbers. # One warning: if you give batch entrez an accession number it can't find, # it may return an error and no results. I didn't get to handling that. # Also note that if you need to display a next page of results, or # results in a different format, extract "WebEnv" and "query_num" to # pass on to the next page. Each new page seems to include new values # for these two things. In other words, one WebEnv can't be used to # page through one set of results - each page of the results gives you # a new WebEnv. The EUtils docs say WebEnv is a key into a server side # cache that can hold up to 10,000 records. # # Use LWP's multipart/form-data POST method to upload a batch # of accession numbers to look up. # Documentation for this is found in question 10 of the LWP FAQ # found here: http://groups.google.com/groups?q=perl+lwp+upload+file&hl=en&lr=lang_en&ie=UTF-8&oe=UTF-8&safe=off&selm=8ek1n2%24ej1%241%40cherry.rt.ru&rnum=2 # Better documentation here: http://www.perldoc.com/perl5.8.0/lib/HTTP/Request/Common.html # but even those docs are not clear how to upload file content without a # file. The key was found in this post: # http://groups.google.com/groups?q=lwp+form-data+undef+content&hl=en&lr=lang_en&ie=UTF-8&oe=UTF-8&safe=off&selm=u6594vsb9c2l1bpp6c1rg6tpf1tcs8o8r9%404ax.com&rnum=2 #my $request = POST ( # 'http://www.ncbi.nlm.nih.gov:80/entrez/batchentrez.cgi', # Content_Type => 'form-data', # Content => # [ # cmd_current => '', # cmd => 'Retrieve', # db => 'protein', # orig_db => 'nucleotide', # This is probably not needed # dispmax => '10000', # Untested, but this should make it return up to 10,000 results # # To fake the upload of a file without writing one out requires # simulating a bunch of extra headers. To upload a real file is # much simpler, requiring just one value of ['./batch.entrez'], # file => [undef, 'batch.entrez', # 'Content-Length' => length($$batch_accession_nums_ref), # 'Content-Type' => 'text/plain', # 'Content' => $$batch_accession_nums_ref], # # ], #); #$response = $browser->request($request); # To find the taxonomy ID for each gi ID, we use the NCBI ESummary utility. # This utility takes a comma delimited list of gi IDs and returns summary # data about each one in XML format. This summary information # includes the taxonomy ID, the gi ID, the sequence ID, and some other # things we don't care to know. To see what ESummary returns, try visiting # this URL: # http://www.ncbi.nih.gov/entrez/eutils/esummary.fcgi?db=protein&id=29837495,114475 # # Documentation for ESummary is here: # http://eutils.ncbi.nlm.nih.gov/entrez/query/static/esummary_help.html my $query_url = "http://www.ncbi.nih.gov/entrez/eutils/esummary.fcgi"; # Older versions of LWP (like the one on the Condor cluster) don't # support the post() method, so use the older request(...POST()) # method instead. #my $response = $browser->post($query_url, my $response = $browser->request(HTTP::Request::Common::POST($query_url, [ 'db' => 'protein', # Interestingly, "protein" and "taxonomy" databases both return the same results 'id' => $query_gi_list, 'tool' => NCBI_TOOL, # Let NCBI know what script is accessing ESummary 'email=' => NCBI_EMAIL, # Let NCBI know who to e-mail if the script is causing problems on their servers. ])); if(!$response->is_success) { print STDERR "get_docsum_from_ncbi: Error '" . $response->status_line . "' retrieving url '" . $query_url . "'.\n"; return; } # Parse the XML response to our ESummary web query. The callback # functions like xml_start_handler will take care of putting all # the data into the global $xml_docsum hash. $xml_parser->parse($response->content); # ESummary is not the only way to get the taxonomy IDs we're looking # for. You can also use EFetch through a URL like this: # http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=11036339,1078661&rettype=native&retmode=xml # where "11036339,1078661" is the comma separated list of gi ids # (although it seems you can also use accession numbers. The # problem with this method is that the amount of information it # returns is enormous, and the time it takes to process and return # it to you in XML is quite noticable even on just two records. # It goes faster using the default non-xml format like this: # http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=11036339,1078661 # but it is still long. However, if you ever need more data than the # taxonomy ID, this may be the only way to get it. # # I also discovered that you can request multiple ids to be displayed # at the same time through the human "query.fcgi" interface using a # url like this: # http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=11036339,1078661&dopt=GenPept # Again, I don't trust that this output will remain stable like the EUtils # output, and I don't know how many ids you can request at once, and # this output is probably based on the EFetch backend data anyway # (so you're probably not saving their servers any work), but it's worth # looking into. E-mail info at ncbi.nlm.nih.gov if you really want to be # sure what is most efficient on their end. } ------------------------------------------------------------ This code is not complete---you may need some of the following packages: ------------------------------------------------------------ # LWP (Library for WWW in Perl) is used to retrieve documents # from web servers. See # http://www.perl.com/pub/a/2002/08/20/perlandlwp.html for documentation. use HTTP::Request::Common; use LWP::UserAgent; # There are a bewildering number of modules written to parse XML, # but I think the following advice found on a newsgroup made the # most sense: # # Go to http://search.cpan.org/ and search for "XML". # Have a look at XML::DOM, XML::Grove, XML::Twig, and maybe # XML::Simple, if you don't mind the uglyish data structures it # exposes to the user. XML::QL might be of interest, even though # it is still immature. # If you are not interested in a tree, but want to just pull out # elements while parsing the file, try XML::Node, or XML::Parser # directly. # # Based on this advice, I have chosen XML::Parser as a standard # module which works efficiently for what we need here. We # don't need to convert the whole document into a memory based # tree structure, we just need to pull out a few values we need. If # you do need a tree structure, I would lean towards XML::Grove # from the list above, though I have little to base that opinion # on. # Note that I'm not too happy with the hoops I had to jump through to # use Parser's event driven methodology. The code is not so easy to # understand. use XML::Parser;