[ssml] Re: Redundancy in MSA for building HMM
Mensur Dlakic
mdlakic at montana.edu
Wed Oct 6 13:19:25 EDT 2004
At 10:51 AM 10/6/2004, you wrote:
>If you care about the alignments, an HMM model will often produce
>better multiple alignments even if it finds the same sequences as
>BLAST.
Very true. But let's put a caveat here that better HMM-produced alignment
will be generated if the initial alignment used for HMM training was good
to begin with. For closely related sequences, this usually is not a problem.
>The original question asked about models for "protein domain families
>(as defined in SCOP)," which may mean family-level models, or
>superfamily, or even fold, depending on how precisely Manisha Goel was
>using the term "families". If one wants to build a model that
>recognizes only one family and not other families in the same
>superfamily, the usual HMM methods will generally generalize too far.
>So far as I know, the best technique for family-level classification
>is to build an SVM classifier that uses an HMM to produce the input
>vectors for the SVM. (See, for example, Rachel Karchin's Master's
>thesis, or her paper
For those interested in this general subject, I suggest this paper:
Nucleic Acids Res. 2002 Apr 1;30(7):1575-84An efficient algorithm for
large-scale detection of protein familiesEnright AJ, Van Dongen S, Ouzounis
CAhttp://nar.oupjournals.org/cgi/content/full/30/7/1575
and this software:
http://micans.org/mcl/
In general, once you detect a group of proteins, and these may be at the
superfamily level if one is not inclined to tune the parameters, this
algorithm will take results of the all-against-itself BLAST search (even
for hundreds of proteins this doesn't take very long) and cluster them into
groups based on various tunable parameters. The most useful parameter is
probably the inflation value (see description on the page above), and high
values of this parameter are likely to generate clustering that is pretty
close to family classifications. Probably not as sophisticated as the
method Kevin suggested, but works very well in most cases and is extremely
fast.
Cheers,
Mensur
==========================================================================
| Mensur Dlakic, PhD | Tel: (406) 994-6576 |
| Department of Microbiology | Fax: (406) 994-4926 |
| Montana State University | |
| 109 Lewis Hall, P.O. Box 173520 | http://myprofile.cos.com/mensur |
| Bozeman, MT 59717-3520 | E-mail: mdlakic at montana.edu |
==========================================================================
More information about the ssml-general
mailing list