[Bioclusters] Re: Parallel BLAST

Glen Otero bioclusters@bioinformatics.org
09 Jun 2002 20:23:04 -0700


Sorry 'bout the late post, my messages were being returned...

There are several commercial parallel BLASTs out there:

1) Blackstone's PowerBLAST (part of PowerCloud),

(Sorry if this sounds too commercial, but I work for Blackstone)

PowerBLAST utilizes data parallelization techniques that automate the
splitting of query databases into smaller chunks that are then spread
out over the cluster nodes' local disks for querying.  Querying smaller
datasets in this way speeds up the process a lot.  PowerBLAST also
automates the merging of BLAST results and uses disk caching and
scheduling techniques to speed up future queries of the same datasets.

2) TurboGenomics' TurboBLAST (more of a grid-like blast than a cluster
BLAST),

TurboBlast is Java based and an extension of the Linda technology
TurboBlast breaks up the database and query into slices and distributes
them over the nodes in a cluster and does the merge for you.


3) Paracel's BLAST Machine
  
Paracel actually got inside BLAST and parallelized the code.  Other than
SGI, they are the only folks I know that have done this.  They post
impressive speed up numbers and the statistics should be the same as an
unaltered BLAST query.

*******
In the words of Bill Pearson (author of FASTA) taken from a post to the
beowulf list in response to why there are no MPI or PVM parallelized
versions of BLAST:

I suspect that BLAST is not available for MPI/PVM because (1) it is
too fast, and (2) there is not much demand for it.  

95% of the time, BLAST is almost an in-memory grep (the other 5% of
the time it is working on the things it is looking for).  Sequence
comparison is embarrassingly parallel, and very easily threaded.
Distributing the sequence databases and collecting results has more
overhead (there probably aren't many distributed grep programs
either).  FASTA is 5 - 10X slower than BLAST, and Smith-Waterman is
another 5-20X slower than FASTA.  Here, the communications overhead is
low, and distributed systems work OK for FASTA, and great for
Smith-Waterman (where the overhead fraction is very small).

Of course, it is a lot easier to compile a threaded program, and just
run it, than it is to install and configure the MPI or PVM environment
and the programs to run in it.  Bioinformatics software is often run
by computer savvy biologists, not high-performance computing folks,
and not having to install and configure PVM/MPI is a big advantage.
The NCBI probably does not make a PVM/MPI parallel BLAST because there
is very little demand for it, and it does not meet their computational
needs.
*********
Hope that helps.

Glen

-- 
Glen Otero, Ph.D.
Senior Life Science Consultant
Blackstone Computing
Phone:619.917.1772
-- 
Glen Otero, Ph.D.
Senior Life Science Consultant
Blackstone Computing
Phone:619.917.1772