Hello , we have recently added nodes to our cluster which is now 63 nodes all of which are Sun v20zs, dual opterons, 2 GB RAM, and SCSI disks. The master is a v40z w 8GB RAM, quad opterons, LSI controller two 146GB ultra 320 SCSIs and an additional 3 73GB ultra 320s. We are using N1 Grid Engine 6 to dole out work and have /home (ext3) exported over NFS to worker nodes. We have two SMC8648Ts for managed switches. Redhat AS 3 is the OS. We are using the TG3 drivers for the broadcom NICS and have an extra 4 NICS courtesy of an expansion card (right now unused). /home is ext3 on a single Ultra 320 146GB drive contained in a LSI controller and we will move to a striped volume as soon as we get the LSI Bios config to give us something other than Raid 1 or 1E (LSI's "extended RAID" basically a mirror with multiple drives). The primary apps are Genesis (synaptic simulation) and some homegrown C++ code. I/O is about 4-5 MB per job and about 100 jobs running at any give time with the average run time of about 2 hours per job. We expect to add BLAST and some rendering jobs later on. I have users use local spool where possible to avoid writes and reads to the NFS export but this isn't always guaranteed that they will do that. Now that we have all 63 up and running it looks like we are getting performance issues with NFS much in the same way that others have reported here. Even moderate job loads produce trouble - (nfsstats -c show lots of retransmissions), grid engine execds don't report back in so qhost shows nodes not responding though eventually they will return. On occasion one of the switches stops and that whole "side" of the cluster disappears. so we reboot the switch and are back in action. Anyway here are my questions (thanks for your patience in reading through this) Has anyone had similar problems with these SMC switches ? I'm not accustomed to having the switches die like this. In terms of improving NFS performance I've already put SGE spool onto the local nodes to try to improve things but only helps a little. There are various NFS tuning documents with respect to clusters ( using tcp, atime, rsize, wsize, etc options to mount). I've experimented with a few of these (rsize, wsize) though with only very marginal positive impact. for those with larger clusters and similar issues have you found a subset of these options to be more key or influential than others ? One scenario that has been discussed is bonding two NICs on the v40z in conjunction with switch trunking. Does anyone have any opinions or ideas on this ? Lastly is it even worth it to keep messing with NFS ? And maybe go for GFS.