[Bioclusters] NFS / SMC switches / GFS

Sun Aug 28 21:10:51 EDT 2005

Hello , we have recently added nodes to our cluster which
is now 63 nodes all of which are Sun v20zs, dual opterons,
2 GB RAM, and SCSI disks. The master is a v40z w 8GB RAM,
quad opterons, LSI controller two 146GB ultra 320 SCSIs and
an additional 3 73GB ultra 320s. We are using N1 Grid Engine 6
to dole out work and have /home (ext3) exported over NFS to worker
nodes. We have two SMC8648Ts for managed switches. Redhat AS 3 is the
OS. We are using the TG3 drivers for the broadcom NICS and have an
extra 4 NICS courtesy of an expansion card (right now unused).

/home is ext3 on a single Ultra 320 146GB drive contained in a LSI
controller and we will move to a striped volume as soon as we get
the LSI Bios config to give us something other than Raid 1 or 1E
(LSI's "extended RAID" basically a mirror with multiple drives).

The primary apps are Genesis (synaptic simulation) and
some homegrown C++ code. I/O is about 4-5 MB per job and
about 100 jobs running at any give time with the average
run time of about 2 hours per job. We expect to add BLAST
and some rendering jobs later on. I have users use local spool
where possible to avoid writes and reads to the NFS export but this 
isn't
always guaranteed that they will do that.

Now that we have all 63 up and running it looks like we are
getting performance issues with NFS much in the same way
that others have reported here. Even moderate job loads
produce trouble - (nfsstats -c show lots of retransmissions),
grid engine execds don't report back in so qhost shows nodes not
responding though eventually they will return. On occasion one of
the switches stops and that whole "side" of the cluster disappears.
so we reboot the switch and are back in action. Anyway here are my
questions (thanks for your patience in reading through this)

Has anyone had similar problems with these SMC switches ?
I'm not accustomed to having the switches die like this.

In terms of improving NFS performance I've already
put SGE spool onto the local nodes to try to improve things
but only helps a little. There are various NFS tuning
documents with respect to clusters ( using tcp, atime, rsize,
wsize, etc options to mount). I've experimented with a few of
these (rsize, wsize) though with only very marginal positive impact.
for those with larger clusters and similar issues have you found
a subset of these options to be more key or influential than others ?

One scenario that has been discussed is bonding two NICs
on the v40z in conjunction with switch trunking. Does anyone
have any opinions or ideas on this ? Lastly is it even worth
it to keep messing with NFS ? And maybe go for GFS.