[ssml] HMM weighting?

Fri Jan 16 15:36:53 EST 2004

"Dan Bolser" <dmb at mrc-dunn.cam.ac.uk> asked
> I would like to ask if the objective (direct or indirect) of
> weighting during model building is to make the model score every
> true hit equally? 

Anders Krogh tried that approach.  It did not work very well, because
it amplified the noise too much.  One incorrect or misaligned sequence
does a lot of damage then, since it needs a huge weight to score as
well as the rest, and then the model is grossly distorted.

I believe that PSI-BLAST does do some sequence weighting, and the SAM
T99 and T2K scripts certainly do.  In fact, if you don't do some sort
of sequence weighting when you use Dirichlet mixtures you get very bad
results, because almost all training sets have many similar sequences.

The main problem with Dirichlet mixtures or pseudocounts is setting the
total weight of the data (how much you believe the data rather than
the prior).  How you allocate the weight to the individual sequences
is much less important, though methods that allocate some more weight to
the outliers generally do a better job of generalization than flat
weighting.  There isn't a clean mathematical optimization here, since
we are dealing with noisy data with unknown but strong sampling biases.
About all we can do is to experiment with weighting schemes and see
what works---and what works well in one application (like
fold-recognition) might not work well in another (like identifying
subfamilies) since different levels of generalization are needed in
different applications.

Kevin Karplus