[Bioclusters] Re: When are diskless compute nodes inappropriate?

15 Jul 2003 13:11:25 -0400

On Tue, 2003-07-15 at 12:27, Nicholas Henke wrote:

[...]

> The one 'practical' situation we see here is on our Genomics cluster,
> where they are running BLAST on very large data sets. It makes an
> extremely large difference to copy the data to a local drive and use
> that than to access the data via NFS.

One thing that you can do is to segment the databases (use the -v switch
on formatdb) or if you don't care about the absolute E-values being
correct relative to your real database size, you could pre-segment the
database using a tool such as our segment.pl at
http://scalableinformatics.com/downloads/segment.pl .  The large cost of
disk access for the large BLAST jobs comes from the way it mmaps the
indices, in case they overflow available memory.  If they do overflow
memory, then you spend your time in disk IO bringing the indices into
memory as you walk through them.  This lowers your overall absolute
performance.

Regardless of the segmentation, it is rarely a good idea (except in the
case of very small databases) to keep them on NFS for the computation.
 Even if they are small, you are going to suffer network congestion very
quickly for a reasonable number of compute nodes.

Of course this gets into the problem of moving the databases out to the
compute nodes.  We are working on a neat solution to the data motion
problem (specifically the database transport problem to the compute
nodes).  To avoid annoying everyone, please go offlist if you want to
speak to us about it.  Email/phone in .sig.

-- 
Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman@scalableinformatics.com
  web: http://scalableinformatics.com
phone: +1 734 612 4615