[ssml] Redundancy in MSA for building HMM

Kevin Karplus karplus at soe.ucsc.edu
Wed Oct 6 12:51:37 EDT 2004


Dan Bolser asked

> The thing is, should you ever buid a model to find higly similar sequences,
> or should you just use blast to do that?


If you want only highly similar sequences, BLAST does a pretty good
job and is *much* faster than an HMM.  If you want sequences in the same
superfamily, BLAST often misses a lot and the HMM does a much better job.

If you care about the alignments, an HMM model will often produce
better multiple alignments even if it finds the same sequences as
BLAST. 

The original question asked about models for "protein domain families
(as defined in SCOP)," which may mean family-level models, or
superfamily, or even fold, depending on how precisely Manisha Goel was
using the term "families".  If one wants to build a model that
recognizes only one family and not other families in the same
superfamily, the usual HMM methods will generally generalize too far.
So far as I know, the best technique for family-level classification
is to build an SVM classifier that uses an HMM to produce the input
vectors for the SVM. (See, for example, Rachel Karchin's Master's
thesis, or her paper

@article{karchin-karplus-haussler02,
         author="Karchin, R. and Karplus, K. and Haussler, D",
         title="Classifying G-protein Coupled Receptors with support vector machines",
         journal="Bioinformatics",
         year={2002},
	 volume=18,
	 pages="147-159"
}

).



More information about the ssml-general mailing list