[ssml] MLE

Fri Aug 27 13:24:09 EDT 2004

l x yi said
> It can be assumed that the sequences
> in the databank are independent sequences, but if we
> are using the same sequence as query each
> time,wouldn't the scores obtained be dependent?

Actually the first assumption is bad---databases are full of repeated
and nearly repeated sequences.  They are far from independent draws
from the usually assumed null models.

There are some query dependences that have been studied---I know that
the authors of BLAST have published and implemented length corrections
to their calibration for short queries.

Statisticians like to assume independence, because without it the math
often becomes intractable.  Independence is rarely really
present---the question is how much error gets introduced by the
independence assumption, and does the computation of E-values provide
a better or worse view of the results than not computing the E-values.

In the case of sequence alignments, the independence assumption is not
too terrible, and calibration does improve the interpretability of
results.  There are known artifacts (such as composition bias and
over-sensitivity to low-entropy sequences), which reflect weaknesses
in the null model used.

------------------------------
Kevin Karplus 	karplus at soe.ucsc.edu	http://www.soe.ucsc.edu/~karplus
Senior member, IEEE	Board of Directors, ISCB (starting Jan 2005)
Professor of Biomolecular Engineering, University of California, Santa Cruz
Undergraduate and Graduate Director, Bioinformatics
Affiliations for identification only.