[ssml] Re: Sequence defined domains?

Fri Nov 28 07:28:01 EST 2003

I don't use similarity matrices for multiple alignment---I use
Dirichlet mixture priors.  Similarity matrices are great for
sequence-sequence alignment, since they do a very good job of
estimating a probability distribution from a single sample.  They're
terrible though at estimating probability distributions from samples
with more than one amino acid in them.

I do iterative search to build my multiple alignments, starting with a
seed (usually a single sequence, but can be a hand-generated
alignment), and using gradually looser thresholds on the search.
This is similar in spirit to the psi-blast iteration, though
independently developed at about the same time.  

My iterations cycle through
	multiple alignment-> HMM
	search for similar sequences
	thin resulting alignment
	retrain HMM on thinned set of sequences
	realign all found sequences using HMM

The method is fairly robust to changes in parameters, as long as the
search threshold is never set so loose as to get in unrelated
sequences.  Since I usually use this method in a fully automated way
(I've run it on at least 15,000 seeds), I can't rely on eyeballing the
results to decide when contamination happens, so I've set the default
values fairly strictly.  If you set thresholds too tight, you miss
some homologs, though, so I have occasionally played with loosening
them up on specific cases that I was playing with by hand.

Since I usually start with a single sequence, the question "How does
the quality of the inital multiple alignment affect the later
development of the HMM on a given database?" is not one I can easily
answer.  Obviously a bad seed alignment is going to cause some
problems.  A good seed alignment can help, but we (and others who have
tried) have not gotten better fold-recognition by starting with a
structural alignment (FSSP with Z score >=7 ) as a seed for HMMs.
It seems best to have multiple HMMs, each of which is somewhat
more specific.

Kevin Karplus