Chapter 4 Advanced
4.1 Sequence Class
4.2 Regression Testing Framework
Biopython has a regression testing framework written Andrew Dalke and ported to PyUnit by Brad Chapman which helps us make sure the code is as bug-free as possible before going out.
4.2.1 Writing a Regression Test
Every module that goes into Biopython should have a test (and should also have documentation!). Let's say you've written a new module called Biospam -- here is what you should do to make a regression test:
-
Write a script called
test_Biospam.py
This script should live in the Tests directory
- The script should test all of the important functionality of the module (the more you test the better your test is, of course!).
- If the script requires files to do the testing, these should go in
the directory Tests/Biospam.
- Write out the test output and verify the output to be correct.
There are two ways to do this:
-
The long way:
Run the script and write its output to a file. On UNIX machines,
you would do something like: python test_Biospam.py > test_Biospam
which would write the output to the file test_Biospam
.
- Manually look at the file
test_Biospam
to make sure the output is correct. When you are sure it is all right and there are no bugs, you need to quickly edit the test_Biospam
file so that the first line is: 'test_Biospam
' (no quotes).
- copy the
test_Biospam
file to the directory Tests/output
- The quick way:
-
Run
python run_tests.py -g test_Biospam.py
. The
regression testing framework is nifty enough that it'll put
the output in the right place in just the way it likes it.
- Go to the output (which should be in
Tests/output/test_Biospam
) and double check the output to make sure it is all correct.
- Now change to the Tests directory and run the regression tests
with
python run_tests.py
. This will run all of the tests, and
you should see your test run (and pass!).
- That's it! Now you've got a nice test for your module.
Congratulations!
4.3 Parser Design
4.3.1 Design Overview
Parsers are built around an event-oriented design that includes
Scanner and Consumer objects.
Scanners take input from a data source and analyze it line by line,
sending off an event whenever it recognizes some information in the
data. For example, if the data includes information about an organism
name, the scanner may generate an organism_name
event whenever it
encounters a line containing the name.
Consumers are objects that receive the events generated by Scanners.
Following the previous example, the consumer receives the
organism_name
event, and the processes it in whatever manner
necessary in the current application.
4.3.2 Events
There are two types of events: info events that tag the location of
information within a data stream, and section events that mark
sections within a stream. Info events are associated with specific
lines within the data, while section events are not.
Section event names must be in the format start_EVENTNAME
and
end_EVENTNAME
where EVENTNAME
is the name of the event.
For example, a FASTA-formatted sequence scanner may generate the
following events:
EVENT NAME ORIGINAL INPUT
begin_sequence
title >gi|132871|sp|P19947|RL30_BACSU 50S RIBOSOMAL PROTEIN L30 (BL27
sequence MAKLEITLKRSVIGRPEDQRVTVRTLGLKKTNQTVVHEDNAAIRGMINKVSHLVSVKEQ
end_sequence
begin_sequence
title >gi|132679|sp|P19946|RL15_BACSU 50S RIBOSOMAL PROTEIN L15
sequence MKLHELKPSEGSRKTRNRVGRGIGSGNGKTAGKGHKGQNARSGGGVRPGFEGGQMPLFQRLPK
sequence RKEYAVVNLDKLNGFAEGTEVTPELLLETGVISKLNAGVKILGNGKLEKKLTVKANKFSASAK
sequence GTAEVI
end_sequence
[...]
(I cut the lines shorter so they'd look nicer in my editor).
The FASTA scanner generated the following events: title
, sequence
,
begin_sequence
, and end_sequence
. Note that the begin_sequence
and end_sequence
events are not associated with any line in the
original input. They are used to delineate separate sequences within
the file.
The events a scanner can send must be specifically defined for each
data format.
4.3.3 'noevent' EVENT
A data file can contain lines that have no meaningful information,
such as blank lines. By convention, a scanner should generate the
"noevent" event for these lines.
4.3.4 Scanners
class Scanner:
def feed(self, handle, consumer):
# Implementation
Scanners should implement a method named 'feed' that takes a file
handle and a consumer. The scanner should read data from the file
handle and generate appropriate events for the consumer.
4.3.5 Consumers
class Consumer:
# event handlers
Consumers contain methods that handle events. The name of the method
is the event that it handles. Info events are passed the line of the
data containing the information, and section events are passed
nothing.
You are free to ignore events that are not interesting for your
application. You should just not implement methods for those events.
All consumers should be derived from the base Consumer class.
An example:
class FASTAConsumer(Consumer):
def title(self, line):
# do something with the title
def sequence(self, line):
# do something with the sequence
def begin_sequence(self):
# a new sequence starts
def end_sequence(self):
# a sequence ends
4.3.6 BLAST
BLAST Scanners produce the following events:
header
version
reference
query_info
database_info
descriptions
description_header
round psi blast
model_sequences psi blast
nonmodel_sequences psi blast
converged psi blast
description
no_hits
alignment
multalign master-slave
title pairwise
length pairwise
hsp
score pairwise
identities pairwise
strand pairwise, blastn
frame pairwise, blastx, tblastn, tblastx
query pairwise
align pairwise
sbjct pairwise
database_report
database
posted_date
num_letters_in_database
num_sequences_in_database
num_letters_searched RESERVED. Currently unused. I've never
num_sequences_searched RESERVED. seen it, but it's in blastool.c..
ka_params
gapped not blastp
ka_params_gap gapped mode (not tblastx)
parameters
matrix
gap_penalties gapped mode (not tblastx)
num_hits
num_sequences
num_extends
num_good_extends
num_seqs_better_e
hsps_no_gap gapped (not tblastx) and not blastn
hsps_prelim_gapped gapped (not tblastx) and not blastn
hsps_prelim_gap_attempted gapped (not tblastx) and not blastn
hsps_gapped gapped (not tblastx) and not blastn
query_length
database_length
effective_hsp_length
effective_query_length
effective_database_length
effective_search_space
effective_search_space_used
frameshift blastx or tblastn or tblastx
threshold
window_size
dropoff_1st_pass
gap_x_dropoff
gap_x_dropoff_final gapped (not tblastx) and not blastn
gap_trigger
blast_cutoff
4.3.7 Enzyme
The Enzyme Scanner produces the following events:
record
identification
description
alternate_name
catalytic_activity
cofactor
comment
disease
prosite_reference
databank_reference
terminator
4.3.8 Fasta
The Fasta Scanner produces the following events:
sequence
title
sequence
4.3.9 Medline
The Online Services Reference Manual documents the MEDLINE format at:
http://www.nlm.nih.gov/pubs/osrm_nlm.html
The Medline scanner produces the following events:
record
undefined
abstract_author
abstract
address
author
call_number
comments
class_update_date
country
entry_date
publication_date
english_abstract
entry_month
gene_symbol
identification
issue_part_supplement
issn
journal_title_code
language
special_list
last_revision_date
mesh_heading
mesh_tree_number
major_revision_date
no_author
substance_name
pagination
personal_name_as_subject
publication_type
number_of_references
cas_registry_number
record_originator
journal_subset
subheadings
secondary_source_id
source
title_abbreviation
title
transliterated_title
unique_identifier
volume_issue
year
pubmed_id
undefined is a special event that is called for every line with a
qualifier not defined in the specification.
4.3.10 Prosite
The Prosite scanner produces the following events:
copyrights
copyright
record
identification
accession
date
description
pattern
matrix
rule
numerical_results
comment
database_reference
pdb_reference
documentation
terminator
The PRODOC scanner produces the following events:
record
accession
prosite_reference
text
reference
4.3.11 SWISS-PROT
The SProt Scanner produces the following events:
record
identification
accession
date
description
gene_name
organism_species
organelle
organism_classification
reference_number
reference_position
reference_comment
reference_cross_reference
reference_author
reference_title
reference_location
comment
database_cross_reference
keyword
feature_table
sequence_header
sequence_data
terminator
The KeyWList scanner produces the following events:
header
keywords
keyword
footer
copyright
4.4 Substitution Matrices
4.4.1 SubsMat
This module provides a class and a few routines for generating substitution matrices, similar to BLOSUM or PAM matrices, but based on user-provided data.
Additionally, you may select a matrix from MatrixInfo.py, a collection of established substitution matrices.
class SeqMat(UserDict.UserDict)
-
Attributes
-
self.data
: a dictionary in the form of {(i1,j1):n1, (i1,j2):n2,...,(ik,jk):nk}
where i, j are alphabet letters, and n is a value.
-
self.alphabet
: a class as defined in Bio.Alphabet
-
self.ab_list
: a list of the alphabet's letters, sorted. Needed mainly for internal purposes
-
self.sum_letters
: a dictionary. {i1: s1, i2: s2,...,in:sn}
where:
-
i: an alphabet letter;
- s: sum of all values in a half-matrix for that letter;
- n: number of letters in alphabet.
- Methods
__init__(self,data=None,alphabet=None,
mat_type=NOTYPE,mat_name='',build_later=0):
-
data
: can be either a dictionary, or another SeqMat instance.
-
alphabet
: a Bio.Alphabet instance. If not provided, construct an alphabet from data.
-
mat_type
: type of matrix generated. One of the following:
-
NOTYPE
- No type defined
- ACCREP
- Accepted Replacements Matrix
- OBSFREQ
- Observed Frequency Matrix
- EXPFREQ
- Expsected Frequency Matrix
- SUBS
- Substitution Matrix
- LO
- Log Odds Matrix
mat_type
is provided automatically by some of SubsMat's functions.
-
mat_name
: matrix name, such as "BLOSUM62" or "PAM250"
-
build_later
: default false. If true, user may supply only alphabet and empty dictionary, if intending to build the matrix later. this skips the sanity check of alphabet size vs. matrix size.
-
entropy(self,obs_freq_mat)
-
obs_freq_mat
: an observed frequency matrix. Returns the matrix's entropy, based on the frequency in obs_freq_mat
. The matrix instance should be LO or SUBS.
-
letter_sum(self,letter)
Returns the sum of all values in the matrix, for the provided letter
-
all_letters_sum(self)
Fills the dictionary attribute self.sum_letters
with the sum of values for each letter in the matrix's alphabet.
-
print_mat(self,f,format="%4d",bottomformat="%4s",alphabet=None)
prints the matrix to file handle f. format
is the format field for the matrix values; bottomformat
is the format field for the bottom row, containing matrix letters. Example output for a 3-letter alphabet matrix:
A 23
B 12 34
C 7 22 27
A B C
The alphabet
optional argument is a string of all characters in the alphabet. If supplied, the order of letters along the axes is taken from the string, rather than by alphabetical order.
- Usage
The following section is layed out in the order by which most people wish to generate a log-odds matrix. Of course, interim matrices can be generated and
investigated. Most people just want a log-odds matrix, that's all.
Generating an Accepted Replacement Matrix
Initially, you should generate an accepted replacement matrix (ARM) from your data. The values in ARM are the counted number of replacements according to your data. The data could be a set of pairs or multiple alignments. So for instance if Alanine was replaced by Cysteine 10 times, and Cysteine by Alanine 12 times, the corresponding ARM entries would be:
('A','C'): 10, ('C','A'): 12
as order doesn't matter, user can already provide only one entry:
('A','C'): 22
A SeqMat instance may be initialized with either a full (first method of counting: 10, 12) or half (the latter method, 22) matrices. A full protein
alphabet matrix would be of the size 20x20 = 400. A half matrix of that alphabet would be 20x20/2 + 20/2 = 210. That is because same-letter entries don't
change. (The matrix diagonal). Given an alphabet size of N:
-
Full matrix size:N*N
- Half matrix size: N(N+1)/2
The SeqMat constructor automatically generates a half-matrix, if a full matrix is passed. If a half matrix is passed, letters in the key should be provided in alphabetical order: ('A','C') and not ('C',A').
At this point, if all you wish to do is generate a log-odds matrix, please go to the section titled Example of Use. The following text describes the nitty-gritty of internal functions, to be used by people who wish to investigate their nucleotide/amino-acid frequency data more thoroughly.
- Generating the observed frequency matrix (OFM)
Use:
OFM = SubsMat._build_obs_freq_mat(ARM)
The OFM is generated from the ARM, only instead of replacement counts, it contains replacement frequencies.
- Generating an expected frequency matrix (EFM)
Use:
EFM = SubsMat._build_exp_freq_mat(OFM,exp_freq_table)
-
exp_freq_table
: should be a FreqTable instance. See section 4.4.2 for detailed information on FreqTable. Briefly, the expected frequency table has the frequencies of appearance for each member of the alphabet. It is
implemented as a dictionary with the alphabet letters as keys, and each letter's frequency as a value. Values sum to 1.
The expected frequency table can (and generally should) be generated from the observed frequency matrix. So in most cases you will generate exp_freq_table
using:
>>> exp_freq_table = SubsMat._exp_freq_table_from_obs_freq(OFM)
>>> EFM = SubsMat._build_exp_freq_mat(OFM,exp_freq_table)
But you can supply your own exp_freq_table
, if you wish
- Generating a substitution frequency matrix (SFM)
Use:
SFM = SubsMat._build_subs_mat(OFM,EFM)
Accepts an OFM, EFM. Provides the division product of the corresponding values.
- Generating a log-odds matrix (LOM)
Use:
LOM=SubsMat._build_log_odds_mat(SFM[,logbase=10,factor=10.0,round_digit=1])
-
Accepts an SFM.
-
logbase
: base of the logarithm used to generate the log-odds values.
-
factor
: factor used to multiply the log-odds values. Each entry is generated by log(LOM[key])*factor And rounded to the round_digit
place after the decimal point, if required.
- Example of use
As most people would want to generate a log-odds matrix, with minimum hassle, SubsMat provides one function which does it all:
make_log_odds_matrix(acc_rep_mat,exp_freq_table=None,logbase=10,
factor=10.0,round_digit=0):
-
acc_rep_mat
: user provided accepted replacements matrix
-
exp_freq_table
: expected frequencies table. Used if provided, if not, generated from the acc_rep_mat
.
-
logbase
: base of logarithm for the log-odds matrix. Default base 10.
-
round_digit
: number after decimal digit to which result should be rounded. Default zero.
4.4.2 FreqTable
FreqTable.FreqTable(UserDict.UserDict)
Attributes:
-
alphabet
: A Bio.Alphabet instance.
-
data
: frequency dictionary
-
count
: count dictionary (in case counts are provided).
- Functions:
-
read_count(f)
: read a count file from stream f. Then convert to frequencies
-
read_freq(f)
: read a frequency data file from stream f. Of course, we then don't have the counts, but it is usually the letter frquencies which are interesting.
- Example of use:
The expected count of the residues in the database is sitting in a file, whitespace delimited, in the following format (example given for a 3-letter alphabet):
A 35
B 65
C 100
And will be read using the FreqTable.read_count(file_handle)
function.
An equivalent frequency file:
A 0.175
B 0.325
C 0.5
Conversely, the residue frequencies or counts can be passed as a dictionary.
Example of a count dictionary (3-letter alphabet):
{'A': 35, 'B': 65, 'C': 100}
Which means that an expected data count would give a 0.5 frequency
for 'C', a 0.325 probability of 'B' and a 0.175 probability of 'A'
out of 200 total, sum of A, B and C)
A frequency dictionary for the same data would be:
{'A': 0.175, 'B': 0.325, 'C': 0.5}
Summing up to 1.
When passing a dictionary as an argument, you should indicate whether it is a count or a frequency dictionary. Therefore the FreqTable class constructor requires two arguments: the dictionary itself, and FreqTable.COUNT or FreqTable.FREQ indicating counts or frequencies, respectively.
Read expected counts. readCount will already generate the frequencies
Any one of the following may be done to geerate the frequency table (ftab):
>>> from SubsMat import *
>>> ftab = FreqTable.FreqTable(my_frequency_dictionary,FreqTable.FREQ)
>>> ftab = FreqTable.FreqTable(my_count_dictionary,FreqTable.COUNT)
>>> ftab = FreqTable.read_count(open('myCountFile'))
>>> ftab = FreqTable.read_frequency(open('myFrequencyFile'))