[Bioclusters] blastall and SGE
Chris Dwan
bioclusters@bioinformatics.org
Wed, 29 Sep 2004 16:04:00 -0400
On Sep 29, 2004, at 3:22 PM, Juan Carlos Perin wrote:
> This is very disappointing considering a single G5 can search the NT
> database in under 3 minutes, while running on multiple nodes actually
> takes well over ten minutes.
This seems like a great opportunity to bring up the old parallel
computing saw:
Parallelizing a computational task adds overhead. In using multiple
CPUs on a single problem, you almost always end up doing more work than
you would have, had you just run the task on a single processor. The
parallel cost can include time spent in the scheduler, time spent
reading files from a shared fileserver, time spent partitioning the
target set, and the time of merging the results back together. At
least in BLAST, there is little to no interprocess communication to
slow things down, thank goodness.
The classic formulation was done by Gene Amdahl many years ago:
Time to run on one CPU = serial_portion + parallelizable_portion
Time to run on N CPUs = serial_portion + (parallel_portion / N) +
parallel_cost(N)
Total work done increases, but the time to complete any single job
drops. Speedup is limited by the non-parallelizable portion of the
code, in this case partitioning the target and merging the results.
There are lots of exceptions to this rule. The big ones are all points
where performance as a function of problem size is discontinuous. This
usually happens when the memory requirements cross a hardware boundary:
Cache -> RAM -> Disk.
Any time that tasks are trivially parallel (a large batch of input
files to be searched against the same target, for example) it will
almost always be more efficient (in terms of CPU-minutes spent on the
problem as a whole) to run each job as a single thread on a single CPU.
This is easier to implement (submit a bunch of jobs to the queuing
system), easier to tune (tune once, run everywhere), and easier to
debug.
The vast majority of the users of BLAST farms are more interested in
throughput than response time. They have thousands of query sequences,
and they want results for all of those queries.
There are some users who really want response time from BLAST. Most
users of the NCBI BLAST server fall in this category. Parallelized
BLAST is for these folks. The process of tuning a cluster to run a
single BLAST job as fast as it possibly can is non-trivial, as lots of
people on this list know.
So the question really comes down to "what do your users want, batch
throughput or response time?"
Chris Dwan
The BioTeam