On 4 Feb 2005, at 6:46 am, Michael Gutteridge wrote: > > I don't believe this problem to be specific to PVM, but could be an > issue with any parallel machine using large node sets. I'm curious as > to strategies anyone else has used to mitigate the problem I've > described, especially for circumstances such as this, where the slave > nodes are merely compute donors. Most very large clusters in the HPC world don't allow NFS at all, or minimise it. Our 1000-node cluster does allow some NFS, but this is to scratch directories, and *not* to all users' home directories, in general. Even then, we are in the process of replacing our NFS scratch directories with true cluster filesystems (GPFS and/or Lustre), largely for performance reasons. NFS really does suck, and NFS abuse by users is the primary cause of cluster failure here. But to answer your question: it sounds like you're automounting your users' home directories. We rapidly found that automount really doesn't work on clusters. Although it's easy to administer, you get the behaviour you're seeing; large numbers of simultaneous mount requests, which overwhelm the NFS server. Consequently, the few NFS filesystems we allow our farm nodes to see, we mount statically in /etc/fstab. We don't automount anything. You still get the multiple mount requests problem when you switch the cluster on (say after a power failure) so on the rare occasions we have to power cycle the whole cluster we have to be careful to only switch on a few dozen machines at a time until they're all up. Tim -- Dr Tim Cutts Informatics Systems Group, Wellcome Trust Sanger Institute GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5 860B 3CDD 3F56 E313 4233