[Bioclusters] blast and nfs

21 Apr 2003 15:55:52 -0400

Hi Ognen:

On Mon, 2003-04-21 at 14:58, Duzlevski, Ognen wrote:
> Hi all,
> 
> we have a 40 node cluster (2 cpus each) and a cluster master that has
> attached storage over fibre, pretty much a standard thingie.

Bottleneck #1...

> All of the nodes get their shared space from the cluster master over
> nfs. I have a user who has set-up an experiment that fragmented a

Bottleneck #2.

[...]

> Are there any usual tricks or setup models utilized in setting up
> clusters? For example, all of my nodes mount the shared space with
> rw/async/rsize=8192,wsize=8192 options. How many nfsd threads usually
> run on a master node? Any advice as to the locations of NCBI databases
> vs. shared space? How would one go about measuring/observing for the
> bottlenecks?

Local disk space via local IDE channels on a 40 node cluster has a
higher aggregate bandwidth than the NAS.  Assuming old slow disks
speaking at 15 MB/s, 40 of these running in parallel gives you a 600
MB/s IO capacity (non-blocking at that).  Your NAS device gives you in
theory 200 MB/s, though it is likely to be less.  If you use more modern
IDE drives that can talk at 30 MB/s, you will have about a 6:1 bandwidth
advantage for local IO (more like 12:1) ...

... and that would be true if the NAS were the weakest link in the
chain.  It is not.  It is the network.  If you are lucky, and using a
gigabit network, then you have a 100 MB/s connection to the head node. 
No matter how fast that disk array is, you still have 100 MB/s to the
head node.  As you increase the size of the cluster, the bandwidth of
this pipe out of the head node drops as 1/N(nodes).  This is the 1/N
effect I occasionally talk about.  Your scalability drops in this model
as you increase the number of nodes or the remote disk utilization.

Ok, so now we know where the problem is, what can be done?  

1) pre-distribute the databases.  
2) local IO only during run
3) combine results at end of run or end of bunches of runs.

Increasing the number of NFSD's probably will not help.  Increasing the
read and write size may help a few cases, but not likely this one.  

You might look at dividing up the network access to the head node
between multiple network adapters.  You will need different IPs for
them, and a specific network fabric design to enable this, not to
mention somewhat more complex mounting/routing efforts.  But it can be
done.  What you are doing here is forestalling solving the problem by
moving it back onto the PCI bus of the head node.  You will still be
network bound, and the machines will still run sluggishly, but likely
less so than before.

Joe

-- 
Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman@scalableinformatics.com
  web: http://scalableinformatics.com
phone: +1 734 612 4615