- Code to work with GenBank
- http://www.ncbi.nlm.nih.gov/
Classes:
Iterator Iterate through a file of GenBank entries
Dictionary Access a GenBank file using a dictionary interface.
ErrorFeatureParser Catch errors caused during parsing.
FeatureParser Parse GenBank data in Seq and SeqFeature objects.
RecordParser Parse GenBank data into a Record object.
NCBIDictionary Access GenBank using a dictionary interface.
_BaseGenBankConsumer A base class for GenBank consumer that implements
some helpful functions that are in common between
consumers.
_FeatureConsumer Create SeqFeature objects from info generated by
the Scanner
_RecordConsumer Create a GenBank record object from Scanner info.
_PrintingConsumer A debugging consumer.
_Scanner Set up a Martel based GenBank parser to parse a record.
ParserFailureError Exception indicating a failure in the parser (ie.
scanner or consumer)
LocationParserError Exception indiciating a problem with the spark based
location parser.
Functions:
index_file Get a GenBank file ready to be used as a Dictionary.
search_for Do a query against GenBank.
download_many Download many GenBank records.
Imported modules
|
|
from Bio import Alphabet, File, Index, SeqFeature
from Bio.Alphabet import IUPAC
from Bio.GenBank import LocationParser
from Bio.ParserSupport import AbstractConsumer, EventGenerator
from Bio.Seq import Seq
from Bio.SeqFeature import Reference
from Bio.SeqRecord import SeqRecord
from Bio.WWW import NCBI, RequestLimiter
import Martel
from Martel import RecordReader
import Record
import genbank_format
import os
import re
import sgmllib
import string
import urlparse
import utils
from xml.sax import handler
|
Functions
|
|
_strip_and_combine
download_many
index_file
index_file_db
search_for
|
|
_strip_and_combine
|
_strip_and_combine ( line_list )
Combine multiple lines of content separated by spaces.
This function is used by the EventGenerator callback function to
combine multiple lines of information. The lines are first
stripped to remove whitepsace, and then combined so they are separated
by a space. This is a simple minded way to combine lines, but should
work for most cases.
|
|
download_many
|
download_many (
gis,
callback_fn,
broken_fn=None,
db='Nucleotide',
delay=127.0,
batchsize=500,
parser=None,
)
download_many(gis, callback_fn[, delay][, batchsize])
Download many records from GenBank. gis is a list of Genbank
Gi's. Each time a record is downloaded, callback_fn is called
with the text of the record. delay is the number of seconds to
wait between requests. Waits 127 seconds by default. abatchsize
is the number of records to request each time. Default is 500
records, which is the maximum NCBI can handle.
This does not check to make sure all gi's are returned. The
client must make sure that the gi's are valid. This may be
implemented in the future.
|
|
index_file
|
index_file (
genbank_file,
index_file,
rec_to_key=None,
)
Index a GenBank file to prepare it for use as a dictionary.
Arguments:
o genbank_file - The name of the GenBank file to be index.
o index_name - The name of the index file which will be created.
o rec_to_key - A function object which, when called with a GenBank
record object, will return a key to be used for the record. If no
function is specified, then the accession numbers will be used as
the keys.
Exceptions
|
|
KeyError( "Duplicate key %s found" % key )
KeyError( "Empty sequence key produced" )
ValueError( "%s does not exist" % genbank_file )
|
|
|
index_file_db
|
index_file_db (
genbank_file,
db_name,
db_directory,
identifier="locus",
aliases=[ "accession" ],
keywords=[],
always_index=0,
)
Index a GenBank file into a database for quick loading.
WARNING: This is very experimental and subject to change.
It requires the use of Andrew Dalke's mindy.
This is very similar to index_file, but uses a database instead
of a flat file to store the information about the genbank_file.
Arguments:
genbank_file - The GenBank formatted file that we want to index.
db_name - The name of the database to create. This name will allow you
to retrieve the file later.
db_directory - The directory where the database information should be
stored.
identifier - The primary identifier used to store records in the file
under. This will be used for retrieving them later.
aliases - Secondary identifiers that point to the record. These can
be used for searching if a primary identifier is not found. This is
useful for GenBank since we'll index by a single identifier (the LOCUS
identifier by default) but might want to search by some other
identifier.
keywords - More advanced Mindy features that I'm not positive
how to make full use of right now.
always_index - A flag indicating whether or not to index a file even
if the file appears not to have changed. By default, the function will
try to skip indexing if it thinks the file hasn't changed.
Exceptions
|
|
SystemExit( "You must have mindy installed:\n" + "http://www.biopython.org/~dalke/mindy-0.1.tar.gz" )
|
|
|
search_for
|
search_for (
search,
database='Nucleotide',
max_ids=500,
)
search_for(search[, database][, max_ids])
Search GenBank and return a list of GenBank identifiers (gi's).
search is the search string used to search the database. database
should be either Nucleotide or Protein . max_ids is the maximum
number of ids to retrieve (default 500).
Exceptions
|
|
ValueError, "database must be 'Nucleotide' or 'Protein'"
|
|
Classes
|
|
Dictionary |
Allow a GenBank file to be accessed using a dictionary interface.
|
ErrorParser |
Parse GenBank files and attempt to catch errors.
|
FeatureParser |
Parse GenBank files into Seq + Feature objects.
|
Iterator |
Iterator interface to move over a file of GenBank entries one at a time.
|
LocationParserError |
Could not Properly parse out a location from a GenBank file.
|
MindyDictionary |
Access a GenBank file using a dictionary interface, though a Mindy DB.
|
NCBIDictionary |
Access GenBank using a read-only dictionary interface.
|
ParserFailureError |
Failure caused by some kind of problem in the parser.
|
RecordParser |
Parse GenBank files into Record objects
|
_BaseGenBankConsumer |
Abstract GenBank consumer providing useful general functions.
|
_FeatureConsumer |
Create a SeqRecord object with Features to return.
|
_RecordConsumer |
Create a GenBank Record object from scanner generated information.
|
_Scanner |
Start up Martel to do the scanning of the file.
|
|
|