[Bioclusters] mpiBLAST statistics

Wed Jan 5 13:16:36 EST 2005

Malay wrote:

> In a recent post I mentioned that "pre-splitting" database screws up 
> BLAST statistics. Aaron Darling pointed out the mpiBLAST version 1.3.0 
> gets the statistics just right. I apologise for my ignorace. But I am 
> curious though how they do it. Can anyone point me to any information?
>
I guess I would be the most qualified person to answer that :)

blast e-value statistics represent the probability of seeing a 
particular alignment between a database and a query of particular 
lengths.  Rather than use raw sequence lengths blast calculates 
effective sequence lengths, which are adjusted to account for edge 
effects.  Karlin and Altschul have a few PNAS papers describing the 
statistics behind edge effects.  In order to calculate accurate e-value 
statistics the effective query and database lengths need to be used.

Immediately after startup, the rank 0 mpiblast process uses the NCBI 
Toolbox code to calculate the effective query and database lengths for 
each query.  It then tree-broadcasts these values to all other mpiblast 
processes.  During the search, the workers report hits using the 
effective query and database lengths to calculate the e-values.

If you're interested in the gory details of the code I'll refer you to 
the small NCBI toolbox patch included with mpiBLAST.  The patch allows 
mpiblast to cull effective query and db lengths, and later, set them 
during the search process.  It's called ncbi_Oct2004_evalue.patch

-Aaron