[Bioclusters] Condor cluster and BLAST
Christopher Dwan
cdwan at bioteam.net
Wed Jan 25 09:51:40 EST 2006
Sandy,
First the nitty gritty:
--------------------------
* I recommend using the binary downloads at ftp://ftp.ncbi.nih.gov/
blast/. If computation is ever the limiting factor on your system
then switch over to custom binaries.
* Follow the README with regards to .ncbirc files and the location of
substitution matrices.
* I usually install binaries and BLAST targets on an NFS shared
directory. This saves me the trouble of updating binaries on all the
nodes if anything changes. when access to the datasets becomes the
performance limiting factor (if it ever does), I rig a system as
described below.
* Pull down a couple of pre-formatted targets from NCBI (ftp://
ftp.ncbi.nih.gov/blast/db) to demonstrate functionality. Then
schedule a conversation with your users about what target sets they
actually want.
* If response time on single queries is ever the limiting factor on
your system, there are many parallel BLAST solutions available. If
it becomes something that people are willing to spend money on, there
are also some really impressive hardware accelerators out there.
Don't worry about either of these unless you have a demonstrated need
for them.
More detail:
-----------------
Installing and tuning BLAST is a very broad question with lots of
history and strong opinions surrounding it. Here are some general
thoughts:
BLAST is I/O bound on large target sets. The very best thing you can
do to improve BLAST performance is to make sure that you have
sufficient RAM on each compute node to hold the index files for your
target sets. Second to that, fast local disk on the nodes is a big
help. I've had great luck with software RAID across two internal disks.
Once the above are met, the next bottleneck will be getting the
target set from shared storage out to the nodes. Most people who are
building a serious BLAST farm set up some way to synchronize the
commonly used targets out to the local disk on the nodes. This begs
the question of ensuring that you do not disrupt running jobs with a
data update. For small installations, this is most simply handled
with sociology rather than technology.
You're unlikely to achieve super-linear speedups by parallelizing
BLAST across the nodes. Parallel BLAST solutions are great for
improving response time on a single query, but most users
(particularly those with command line access to a cluster) are not
interested in just running a single query. The most common use case
is the user with thousands of independent queries all to be run as a
batch. The most effective way to get this sort of job done is one-
job-per-cpu. As many folks have pointed out, this is "high-
throughput" computing rather than "high-performance" per se.
BLAST targets need to be freshened and updated on a regular basis.
This requires some sort of agreement with the users as to their
expectations. If nobody plans to use the WGS dataset, that's around
56GB of disk space and network bandwidth that can be saved. Some
datasets (NR, NT, etc) are published every few months with daily
updates in between. Others have different schedules. Figure out
what your users need before building a system to try to support it.
Good luck!
-Chris Dwan
More information about the Bioclusters
mailing list