[ssml] Re: Redundancy in MSA for building HMM

Mensur Dlakic mdlakic at montana.edu
Wed Oct 6 13:19:25 EDT 2004


At 10:51 AM 10/6/2004, you wrote:
>If you care about the alignments, an HMM model will often produce
>better multiple alignments even if it finds the same sequences as
>BLAST.

Very true. But let's put a caveat here that better HMM-produced alignment 
will be generated if the initial alignment used for HMM training was good 
to begin with. For closely related sequences, this usually is not a problem.

>The original question asked about models for "protein domain families
>(as defined in SCOP)," which may mean family-level models, or
>superfamily, or even fold, depending on how precisely Manisha Goel was
>using the term "families".  If one wants to build a model that
>recognizes only one family and not other families in the same
>superfamily, the usual HMM methods will generally generalize too far.
>So far as I know, the best technique for family-level classification
>is to build an SVM classifier that uses an HMM to produce the input
>vectors for the SVM. (See, for example, Rachel Karchin's Master's
>thesis, or her paper


For those interested in this general subject, I suggest this paper:

Nucleic Acids Res. 2002 Apr 1;30(7):1575-84An efficient algorithm for 
large-scale detection of protein familiesEnright AJ, Van Dongen S, Ouzounis 
CAhttp://nar.oupjournals.org/cgi/content/full/30/7/1575

and this software:

http://micans.org/mcl/

In general, once you detect a group of proteins, and these may be at the 
superfamily level if one is not inclined to tune the parameters, this 
algorithm will take results of the all-against-itself BLAST search (even 
for hundreds of proteins this doesn't take very long) and cluster them into 
groups based on various tunable parameters. The most useful parameter is 
probably the inflation value (see description on the page above), and high 
values of this parameter are likely to generate clustering that is pretty 
close to family classifications. Probably not as sophisticated as the 
method Kevin suggested, but works very well in most cases and is extremely 
fast.

Cheers,
Mensur

==========================================================================
| Mensur Dlakic, PhD                | Tel: (406) 994-6576                |
| Department of Microbiology        | Fax: (406) 994-4926                |
| Montana State University          |                                    |
| 109 Lewis Hall, P.O. Box 173520   | http://myprofile.cos.com/mensur    |
| Bozeman, MT 59717-3520            | E-mail: mdlakic at montana.edu        |
==========================================================================




More information about the ssml-general mailing list