[Bioclusters] mpiBLAST Performance

Wed, 14 May 2003 14:29:21 -0600

"Osborne, John" wrote:
> 
> >Each node will get one piece. If you fragment the database using the
> command
> >mpiblast -N 20 ...
> 
> I thought fragmenting was done only by mpiformatdb?

You are correct! My bad -- I mistyped (should have been "mpiformatdb -N
...").
Sorry about that.

> >you will get 21 fragments. When you run mpiblast, however, you should
> provide 22 machines
> >(21 workers + 1 master). If you specify less than 22 nodes, at least one
> >node will have to process more than one fragment (with the associated cost
> of
> >of copying the needed database fragment and the accumulation of multiple
> >database fragments on multiple nodes).
> >
> I'm not sure how you provide the master node exactly, I have just included
> mine
> making it node 0.  Why do you provide 22 machines for 21 fragments?

mpiblast will attempt to create a worker process (running on its own
node) for each database
fragment in addition to a single master process (that is responsible for
distributing work to
the worker nodes and assembling the final output) running on its own
node.

> >Also, while not a factor when blasting against the nr database, shuffling
> the
> >nt database yields a substantial speed increase in blast searches (I have
> obtained
> >a 28% decrease in wall clock time for certain nucleotide queries).
> >
> Shuffling?
> 

I should have been more clear here. By "shuffling" I mean randomizing
the order of the
sequences in the database file in order to improve load balancing. The
running time of mpiblast
is limited by the time is takes to for the slowest worker to finish its
task (assuming one 
fragment per worker). Since sequences in a particular database (nt for
instance)
may be ordered according to biological relevance (i.e. pathway, sequence
similarity, ...) a query
sequence may generate a lot of hits against a cluster of sequences in
the database. This will
slow down the worker node that happened to have this cluster in its
database fragment (and limit the
overall speed of mpiblast). To prevent clusters of similar sequences
from showing up in 
the same fragment, one can randomize the order of sequences in the
database.

Regards,

Jason

B-1 Div
Los Alamos National Lab