[ssml] Finding Matches using N-term & C-term sequences
Kevin Karplus
karplus at soe.ucsc.edu
Tue Dec 9 23:09:32 EST 2003
With 15 N-terminal residues and 3 internal peptides of unspecified
length, you may or may not be able to identify homologs. You don't
want to compute E-values for the 4 small searches separately---almost
nothing will come up as significant. You want to combine the scores
from the separate searches. You can get a rough approximation by
saying that the p-value for finding the same sequence from query A and
query B is roughly the product of the p-values (this isn't quite
right, but is probably close enough for your purposes). The E-value
is just the p-value times the effective size of the database being
searched.
Note that the produce of E-values would have to be scaled down by the
effective size of the database, so two queries hitting with E-value 1
on the same sequence is not an E-value of 1, but an E-value of
1/database size. Getting two or three hits on the same sequence with
your different queries will quickly become significant, even if the
individual E-values are around 100.
If you know the order of the internal peptides, you can build a
profile HMM for the sequence with high-probability inserts between the
fragments, and use HMM tools to look for homologs. There are only 6
possible orders, so you could just try all 6.
Kevin Karplus karplus at soe.ucsc.edu http://www.soe.ucsc.edu/~karplus
Professor of Computer Engineering, University of California, Santa Cruz
Undergraduate and Graduate Director, Bioinformatics
Affiliations for identification only.
More information about the ssml-general
mailing list