[Bioclusters] Parallel blast
Ivo Grosse
bioclusters@bioinformatics.org
Fri, 07 Jun 2002 11:27:51 -0400
Joe Landman <landman@scientificappliance.com> wrote on Fri, 7 Jun 2002:
> I do not know precisely what Paracel's code does.
Also I don't know *precisely* what the code does. I only know
*vaguely* that
- it can fragment the query sequence and also the database, and
- it recomputes the final P and E values based on the set of P and E
values obtained for the query-sequence / database fragments, and
Paracel is proud on the fact that their final P and E values are
identical (plus/minus epsilon) to the P and E values that would have
been obtained by running NCBI Blast on the non-fragmented
query-sequence and the non-fragmented database.
> pathological case (e.g. worst case) was something Ivo Grosse suggested
> with Chr21 vs pufferfish, where I was getting about 8x speedup on 16
> CPUs.
If I remember correctly, Paracel's Blast had almost exactly the same
speed, so a speed-loss of 50% per node seemed normal for programs that
also fragment the database.
> work by segmenting the input query sequences, optionally segmenting the
> databases (this isnt always a performance win though),
Exactly. I guess when only fragmenting the query sequence, but not the
database, the Blast throughput should scale almost linearly with the
number of nodes, till the fileserver cannot handle the output anymore.
The only problem with not fragmenting the database is that:
- the database may not fit into memory, or
- you may need to buy more memory for *each* of the compute nodes, and
if alternatively you would spend that amount of money for additional
nodes, then it may be that a Blast program that can split the database
runs faster on the larger cluster than a Blast program that cannot
split the database runs on the smaller cluster.
Ivo