On 13 Jul 2005, at 7:01 pm, M. Michael Barmada wrote: > Hi Carlos, > > If its any help, we also had similar problems with our cluster. Our > solution > was to train the users to include code in their scripts that would > create > local directories (on the compute node - in /tmp) and copy the > files they > needed to those directories, then do their computing locally and > copy back > the results. Absolutely. And preferably do the copying with something other than NFS too - rcp or rsync work well, or the scheduler's built-in mechanism. Most batch schedulers have built in abilities to this - LSF certainly does, in the form of lsrcp and various options to bsub. I don't know about SGE - I'm not familiar with it, but I imagine the same sort of features are available. It really is quite amazing how badly NFS scales. I remember having serious problems with it on the first Linux cluster I built at Incyte's UK office about 6 years ago, and that was just 7 dual-CPU nodes talking to a Sun E3000 NFS server. It didn't crash, but it got *really* slow - and that was deliberately caching the data locally (I wrote wrapper scripts around blastall and other applications to cache the databases locally, blowing them away by a least-recently-used method if there wasn't room). Sanger's current 1100 node cluster still has NFS in places, and it regularly causes us grief. Our medium-term aim is to remove pretty much all NFS from the cluster altogether, with the possible exception of automounted home directories, and use cluster filesystems like Lustre for shared data. Tim -- Dr Tim Cutts Informatics Systems Group, Wellcome Trust Sanger Institute GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5 860B 3CDD 3F56 E313 4233