On 19/01/07, Michael Nuhn <nuhn at rhrk.uni-kl.de> wrote: > The link you sent shows how the bit score (S') is derived from the raw score > (S): > > S' = ( lambda * S - ln K) / ln2 > > Where the value of lambda is only derived from the scoring matrix and K is a > constant that I don't understand. > > Where does the background distribution of the amino acid (or in my case DNA) > sequence of the query come in? Hi Michael and everyone, i don't know how the things work out for DNA sequences, but for proteins the background frequency is in the raw score S. The raw score S is the sum of all scores of all HSPs (High Scoring Pairs) of the query and a considered sequence. The score of a HSP is the sum of all pairwise scores of all AAs of that HSP. The pairwise scores come from a substitution matrix like BLOSUM or PAM etc. The pairwise score Spw between AAs i and j finally is computed by the log odds ratio of target frequency and background frequencies Spw_ij = log( Q_ij / (P(i) P(j) )) / λ where Q_ij is the target frequency derived from the respective substitution model (PAM, BLOSUM etc.) and P(i) and P(j) in the end are the overall background frequencies of AA i an j. For λ the equation sum_i,j P(i) P(j) exp(λSij) = 1 must hold. The above can be found here http://blast.wustl.edu/doc/infotheory.html and for BLAST specifically http://blast.wustl.edu/doc/infotheory.html#KAStats hth Martin -- + gpg : http://user.cs.tu-berlin.de/~mhe/pub/martin.gpg + gpg fp: 4844 71B5 B4E4 3892 69CA 6EA5 6598 61BE 0021 94A2 + http://ni.cs.tu-berlin.de/ + In the beginning was the WORD, and the WORD was UNSIGNED, + and the main(){} was without form and void