[Bioclusters] blast and nfs

Mon, 21 Apr 2003 15:18:00 -0400

Duzlevski, Ognen wrote:
> Hi all,
> 
> we have a 40 node cluster (2 cpus each) and a cluster master that has
> 
attached storage over fibre, pretty much a standard thingie.
> 
> All of the nodes get their shared space from the cluster master over
nfs. I have a user who has set-up an experiment that fragmented a
database into 200,000 files which are then being blasted against the
standard NCBI databases which reside on the same shared space on the
cluster master and are visible on the nodes (he basically rsh-s into all
the nodes in a loop and starts jobs). He could probably go about his
business in a better way but for the sake of optimizing the setup, I am
actually glad that testing is being done the way it is.
> 
> I noticed that the cluster master itself is under heavy load (it is a
> 
2 CPU machine), and most of the load comes from the nfsd threads (kernel
space nfs used).
> 
> Are there any usual tricks or setup models utilized in setting up
clusters? For example, all of my nodes mount the shared space with
rw/async/rsize=8192,wsize=8192 options. How many nfsd threads usually
run on a master node? Any advice as to the locations of NCBI databases
vs. shared space? How would one go about measuring/observing for the
bottlenecks?

Hi Ognen,

There are many people on this list who have similar setups and have 
worked around NFS related bottlenecks in various ways depending on the 
complexity of their needs.

One easy way to avoid NFS bottlenecks is to realize that BLAST is 
_aways_ going to be performance bound by IO speeds and that generally 
your IO access to local disk is going to be far faster than your NFS 
connection. Done right, local IDE drives in a software RAID 
configuration can get you better speeds than a direct GigE connection to 
a NetApp filer or fibrechannel SAN.

Another way to put this: You will NEVER (well, without exotic storage 
hardware) be able to build a NFS fileserver that cannot be swamped by 
lots of cheap compute nodes going long sequential reads against 
network-mounted BLAST databases. You need to engineer around the NFS 
bottleneck that is slowing you down.

All you need to do is have enough local disk in each of your compute 
nodes to hold all (or some) of your BLAST datasets. The idea is that you 
use the NFS mounted blast databases only as a 'staging area' for 
rsync'ing or copying your files to scratch or temp space on your compute 
nodes. Given the cheap cost of 40-80gb IDE disk drives this is a quick 
and easy way to get around NFS related bottlenecks.

Each search can then be done against local disk on each compute node 
rather than all nodes hitting the NFS fileserver and beating it to death...

This is generally what most BLAST farm operators will do as a "first 
pass" approach. It works very well and is pretty much standard practice 
these days.

The "second pass" approach is more complicated and involves splitting up 
your blast datasets into RAM-sized chunks, distributing them across the 
nodes in your cluster and then multiplexing your query across all the 
nodes to get faster throughput times. This is harder to implement and is 
useful only for long queries against big databases as there is a certain 
amount of overhead required to merge your multiplexed query results back 
into one human or machine parsable file.

People only implement the 'second pass' approach when they really need 
to. Usually in places where pipelines are constantly repeating the same 
big searches over and over again.

My $.02 of course

-Chris
www.bioteam.net