Hi Juan: A good rule of thumb is that in aggregate, local I/O is almost always fastest (with very few exceptions). If you can move your data set off the NFS server as part of the job stage to local disk, you will usually get better performance per node. This may not always be easy or practical in some cases. You simply need to look at local disk not as a storage medium, but as a cache medium, and then handle the loading and clearing of cache before and after your run. Joe Juan Carlos Perin wrote: > > I had a question to see if anyone had any knowledge of a problem we've > been encountering. It seems our Apple cluster is crashing due to NFS. > When we run large batch jobs that frequently access an NFS mount, the > system ends up accumulating 'stuck' processes. If the job is able to > finish it eventually cleans the 'stuck' processes, and all is well. > But, if the job continues to allow accumulation of these stuck > processes, if a given job runs long enough, the system slowly > deteriorates and becomes less and less responsive, eventually freezing > up and not allowing anything to function at all. > > We started the maximum number of NFS servers (20) and this improved > things, but didn't fix them. We also limited the jobs to 10 nodes (20 > processors) to theoretically allow one node to access one NFS pipeline > at any given time. I'm not sure if anyone has run into this before, or > if anyone has ideas on how to approach fixing this problem. The only > errors we're seeing otherwise are in the system log, complaining about > PasswordService not matching the clients response. > > We're still running OSX 10.3.8 and our jobs are running through SGE > 5.3. And we've got a 16 node (32 processor G5 system) with at least 2gb > RAM per node. The programs running are a mixture of text mining > algorithms in both Perl and Java. Both requiring frequent reads on > large .txt files residing on NFS shared directories. > > Thanks in advance, for any ideas or suggestions. > > Juan Perin > Children's Hospital of Philadelphia > _______________________________________________ > Bioclusters maillist - Bioclusters at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615