On Fri, 27 Aug 2004, Kevin Karplus wrote: > >l x yi said >> It can be assumed that the sequences >> in the databank are independent sequences, but if we >> are using the same sequence as query each >> time,wouldn't the scores obtained be dependent? > >Actually the first assumption is bad---databases are full of repeated >and nearly repeated sequences. They are far from independent draws >from the usually assumed null models. > >There are some query dependences that have been studied---I know that >the authors of BLAST have published and implemented length corrections >to their calibration for short queries. > >Statisticians like to assume independence, because without it the math >often becomes intractable. Independence is rarely really >present---the question is how much error gets introduced by the >independence assumption, and does the computation of E-values provide >a better or worse view of the results than not computing the E-values. A very interesting homology search algorithm is called The Family Pairwise Search (FPS). http://fps.sdsc.edu/ It starts from the assumption that multiple hits from sequences in the same family to a target sequence can accumulate a 'family p-value' for the target sequence. If each hit is independant the famliy p-value is simply a multiplication of each hits p-value. As this is rarly the case (sequences from the same family are probably related), they develop the notion of 'effective family size', that is the effective number of 'independant' sequences in the family. If all the sequences are the same this is 1, if they are all truly independant the effective family size is the same as the family size. However, I don't think they address the question below. > >In the case of sequence alignments, the independence assumption is not >too terrible, and calibration does improve the interpretability of >results. There are known artifacts (such as composition bias and >over-sensitivity to low-entropy sequences), which reflect weaknesses >in the null model used. > >------------------------------ >Kevin Karplus karplus at soe.ucsc.edu http://www.soe.ucsc.edu/~karplus >Senior member, IEEE Board of Directors, ISCB (starting Jan 2005) >Professor of Biomolecular Engineering, University of California, Santa Cruz >Undergraduate and Graduate Director, Bioinformatics >Affiliations for identification only. >_______________________________________________ >ssml-general mailing list >ssml-general at bioinformatics.org >https://bioinformatics.org/mailman/listinfo/ssml-general >