[ssml] Finding Matches using N-term & C-term sequences

Tue Dec 9 23:09:32 EST 2003

With 15 N-terminal residues and 3 internal peptides of unspecified
length, you may or may not be able to identify homologs.  You don't
want to compute E-values for the 4 small searches separately---almost
nothing will come up as significant.  You want to combine the scores
from the separate searches.  You can get a rough approximation by
saying that the p-value for finding the same sequence from query A and
query B is roughly the product of the p-values (this isn't quite
right, but is probably close enough for your purposes).  The E-value
is just the p-value times the effective size of the database being
searched.

Note that the produce of E-values would have to be scaled down by the
effective size of the database, so two queries hitting with E-value 1
on the same sequence is not an E-value of 1, but an E-value of
1/database size.  Getting two or three hits on the same sequence with
your different queries will quickly become significant, even if the
individual E-values are around 100.

If you know the order of the internal peptides, you can build a
profile HMM for the sequence with high-probability inserts between the
fragments, and use HMM tools to look for homologs.  There are only 6
possible orders, so you could just try all 6.

Kevin Karplus 	karplus at soe.ucsc.edu	http://www.soe.ucsc.edu/~karplus
Professor of Computer Engineering, University of California, Santa Cruz
Undergraduate and Graduate Director, Bioinformatics
Affiliations for identification only.