[Bioclusters] mpiBLAST statistics

Wed Jan 5 13:22:32 EST 2005

Aaron Darling wrote:
> Malay wrote:
> 
>> In a recent post I mentioned that "pre-splitting" database screws up 
>> BLAST statistics. Aaron Darling pointed out the mpiBLAST version 1.3.0 
>> gets the statistics just right. I apologise for my ignorace. But I am 
>> curious though how they do it. Can anyone point me to any information?
>>
> I guess I would be the most qualified person to answer that :)
> 
> blast e-value statistics represent the probability of seeing a 
> particular alignment between a database and a query of particular 
> lengths.  Rather than use raw sequence lengths blast calculates 
> effective sequence lengths, which are adjusted to account for edge 
> effects.  Karlin and Altschul have a few PNAS papers describing the 
> statistics behind edge effects.  In order to calculate accurate e-value 
> statistics the effective query and database lengths need to be used.
> 
> Immediately after startup, the rank 0 mpiblast process uses the NCBI 
> Toolbox code to calculate the effective query and database lengths for 
> each query.  It then tree-broadcasts these values to all other mpiblast 
> processes.  During the search, the workers report hits using the 
> effective query and database lengths to calculate the e-values.
> 
> If you're interested in the gory details of the code I'll refer you to 
> the small NCBI toolbox patch included with mpiBLAST.  The patch allows 
> mpiblast to cull effective query and db lengths, and later, set them 
> during the search process.  It's called ncbi_Oct2004_evalue.patch
> 

Fantastic achievement indeed! My congratulations! I stand corrected. I 
surely will look into it. Thanks a lot Aron.

Malay