[ssml] MLE

Dan Bolser dmb at mrc-dunn.cam.ac.uk
Fri Aug 27 13:41:14 EDT 2004


On Fri, 27 Aug 2004, Kevin Karplus wrote:

>
>l x yi said
>> It can be assumed that the sequences
>> in the databank are independent sequences, but if we
>> are using the same sequence as query each
>> time,wouldn't the scores obtained be dependent?
>
>Actually the first assumption is bad---databases are full of repeated
>and nearly repeated sequences.  They are far from independent draws
>from the usually assumed null models.
>
>There are some query dependences that have been studied---I know that
>the authors of BLAST have published and implemented length corrections
>to their calibration for short queries.
>
>Statisticians like to assume independence, because without it the math
>often becomes intractable.  Independence is rarely really
>present---the question is how much error gets introduced by the
>independence assumption, and does the computation of E-values provide
>a better or worse view of the results than not computing the E-values.


A very interesting homology search algorithm is called The Family Pairwise
Search (FPS).

http://fps.sdsc.edu/

It starts from the assumption that multiple hits from sequences in the
same family to a target sequence can accumulate a 'family p-value' for the
target sequence.

If each hit is independant the famliy p-value is simply a multiplication
of each hits p-value. As this is rarly the case (sequences from the same
family are probably related), they develop the notion of 'effective family
size', that is the effective number of 'independant' sequences in the
family. If all the sequences are the same this is 1, if they are all
truly independant the effective family size is the same as the family
size. 

However, I don't think they address the question below.


>
>In the case of sequence alignments, the independence assumption is not
>too terrible, and calibration does improve the interpretability of
>results.  There are known artifacts (such as composition bias and
>over-sensitivity to low-entropy sequences), which reflect weaknesses
>in the null model used.
>
>------------------------------
>Kevin Karplus 	karplus at soe.ucsc.edu	http://www.soe.ucsc.edu/~karplus
>Senior member, IEEE	Board of Directors, ISCB (starting Jan 2005)
>Professor of Biomolecular Engineering, University of California, Santa Cruz
>Undergraduate and Graduate Director, Bioinformatics
>Affiliations for identification only.
>_______________________________________________
>ssml-general mailing list
>ssml-general at bioinformatics.org
>https://bioinformatics.org/mailman/listinfo/ssml-general
>




More information about the ssml-general mailing list