Aaron Darling wrote: > Malay wrote: > >> In a recent post I mentioned that "pre-splitting" database screws up >> BLAST statistics. Aaron Darling pointed out the mpiBLAST version 1.3.0 >> gets the statistics just right. I apologise for my ignorace. But I am >> curious though how they do it. Can anyone point me to any information? >> > I guess I would be the most qualified person to answer that :) > > blast e-value statistics represent the probability of seeing a > particular alignment between a database and a query of particular > lengths. Rather than use raw sequence lengths blast calculates > effective sequence lengths, which are adjusted to account for edge > effects. Karlin and Altschul have a few PNAS papers describing the > statistics behind edge effects. In order to calculate accurate e-value > statistics the effective query and database lengths need to be used. > > Immediately after startup, the rank 0 mpiblast process uses the NCBI > Toolbox code to calculate the effective query and database lengths for > each query. It then tree-broadcasts these values to all other mpiblast > processes. During the search, the workers report hits using the > effective query and database lengths to calculate the e-values. > > If you're interested in the gory details of the code I'll refer you to > the small NCBI toolbox patch included with mpiBLAST. The patch allows > mpiblast to cull effective query and db lengths, and later, set them > during the search process. It's called ncbi_Oct2004_evalue.patch > Fantastic achievement indeed! My congratulations! I stand corrected. I surely will look into it. Thanks a lot Aron. Malay