On Fri, 27 Aug 2004, Kevin Karplus wrote:

l x yi said
It can be assumed that the sequences
in the databank are independent sequences, but if we
are using the same sequence as query each
time,wouldn't the scores obtained be dependent?

Actually the first assumption is bad---databases are full of repeated
and nearly repeated sequences. They are far from independent draws
from the usually assumed null models.

There are some query dependences that have been studied---I know that
the authors of BLAST have published and implemented length corrections
to their calibration for short queries.

Statisticians like to assume independence, because without it the math
often becomes intractable. Independence is rarely really
present---the question is how much error gets introduced by the
independence assumption, and does the computation of E-values provide
a better or worse view of the results than not computing the E-values.

A very interesting homology search algorithm is called The Family Pairwise Search (FPS).
http://fps.sdsc.edu/

It starts from the assumption that multiple hits from sequences in the same family to a target sequence can accumulate a 'family p-value' for the target sequence. If each hit is independant the famliy p-value is simply a multiplication of each hits p-value. As this is rarly the case (sequences from the same family are probably related), they develop the notion of 'effective family size', that is the effective number of 'independant' sequences in the family. If all the sequences are the same this is 1, if they are all truly independant the effective family size is the same as the family size.

However, I don't think they address the question below.

In the case of sequence alignments, the independence assumption is not
too terrible, and calibration does improve the interpretability of
results. There are known artifacts (such as composition bias and
over-sensitivity to low-entropy sequences), which reflect weaknesses
in the null model used.