[Bioclusters] Grid BLAST

Sat Sep 24 14:40:03 EDT 2005

> Hi, I'm the administrator the bioinformatics laboratory at Université
> du Québec à Montréal.  I have a room filled with dual P4 3GHz
> workstations.  The boxen are dual booted with Windows and 
> GNU/Linux but
> they spend most of their time on GNU/Linux.  Each box have 2Gb of RAM
> so I expected decent performance with local BLAST jobs but the sad
> truth is that my jobs are run about 4 times slower with blast2 than
> with blastcl3 with the same parameters.  The hard drive is IDE so I
> suspect a bottle neck here.
Make sure the IDE drivers are configured to use DMA I/O, but if repeat
searches of a database are just as slow as the first time it is searched,
then experience indicates the problem is that the amount of free memory
available is insufficient to cache the database files.  Database file
caching is a tremendous benefit for blastn searches.  If your jobs too much
heap memory, though, no memory may be available for file caching.

>  Strangely if I set the number of threads
> to 2 or 3 with -a my jobs run slower.
Use of more threads requires more working (heap) memory for the search,
making less memory available to cache database files.  If the database files
aren't cached, more threads means more terribly slow disk head seeking as
the different threads request different pieces of the database.  If heap
memory expands beyond the physical memory available, the system will thrash.
With WU-BLAST, multiple threads are used by default, but if memory is found
to be limiting, the program automatically reduces the number of threads
employed, to avoid thrashing.

For WU-BLAST, the nucleotide sequence database files that are most important
to cache are the compressed sequence file and the table file, which have
extensions .xns and .xnt.

> Do you think there is a way to like those machines together in order
> to get better performance ?  I think that I can't use something like
> mpiBlast because there is always a risk that a node be rebooted
> under windows making a part of the database suddenly unavailable.

If the system is not in a thrashing state due to its heap memory
requirements (or the heap memory requirements of other concurrent
processes), segmentation of the database can permit nodes to cache their
assigned portion of the database.  Only the first search is slow then.
Subsequent searches utilize the cached copy of the database segment.  With
WU-BLAST, the database can be segmented dynamically at run time for each
compute node, using the dbslice option.  (See
http://blast.wustl.edu/blast/parameters.html#dbslice).  By distributing each
job across multiple nodes and assigning the same slice to each node for
every job, you'll be able to take advantage of file caching.  If nodes come
and go from the cluster, just re-assign slices -- no need to re-format the
database.

More information about blast memory use is available at
http://blast.wustl.edu/blast/Memory.html.

--Warren