[Bioclusters] mpiBLAST statistics
Malay
mbasu at mail.nih.gov
Wed Jan 5 13:22:32 EST 2005
Aaron Darling wrote:
> Malay wrote:
>
>> In a recent post I mentioned that "pre-splitting" database screws up
>> BLAST statistics. Aaron Darling pointed out the mpiBLAST version 1.3.0
>> gets the statistics just right. I apologise for my ignorace. But I am
>> curious though how they do it. Can anyone point me to any information?
>>
> I guess I would be the most qualified person to answer that :)
>
> blast e-value statistics represent the probability of seeing a
> particular alignment between a database and a query of particular
> lengths. Rather than use raw sequence lengths blast calculates
> effective sequence lengths, which are adjusted to account for edge
> effects. Karlin and Altschul have a few PNAS papers describing the
> statistics behind edge effects. In order to calculate accurate e-value
> statistics the effective query and database lengths need to be used.
>
> Immediately after startup, the rank 0 mpiblast process uses the NCBI
> Toolbox code to calculate the effective query and database lengths for
> each query. It then tree-broadcasts these values to all other mpiblast
> processes. During the search, the workers report hits using the
> effective query and database lengths to calculate the e-values.
>
> If you're interested in the gory details of the code I'll refer you to
> the small NCBI toolbox patch included with mpiBLAST. The patch allows
> mpiblast to cull effective query and db lengths, and later, set them
> during the search process. It's called ncbi_Oct2004_evalue.patch
>
Fantastic achievement indeed! My congratulations! I stand corrected. I
surely will look into it. Thanks a lot Aron.
Malay
More information about the Bioclusters
mailing list