[Bioclusters] OS X and NFS

Wed Jul 13 11:22:27 EDT 2005

I had a question to see if anyone had any knowledge of a problem we've 
been encountering.  It seems our Apple cluster is crashing due to NFS.  
When we run large batch jobs that frequently access an NFS mount, the 
system ends up accumulating  'stuck' processes.  If the job is able to 
finish it eventually cleans the 'stuck' processes, and all is well.  
But, if the job continues to allow accumulation of these stuck 
processes, if a given job runs long enough, the system slowly 
deteriorates and becomes less and less responsive, eventually freezing 
up and not allowing anything to function at all.

We started the maximum number of NFS servers (20) and this improved 
things, but didn't fix them.  We also limited the jobs to 10 nodes (20 
processors) to theoretically allow one node to access one NFS pipeline 
at any given time.  I'm not sure if anyone has run into this before, or 
if anyone has ideas on how to approach fixing this problem.  The only 
errors we're seeing otherwise are in the system log, complaining about 
PasswordService not matching the clients response.

We're still running OSX 10.3.8 and our jobs are running through SGE 
5.3.  And we've got a 16 node (32 processor G5 system) with at least 2gb 
RAM per node.   The programs running are a mixture of text mining 
algorithms in both Perl and Java.  Both requiring frequent reads on 
large .txt files residing on NFS shared directories.

Thanks in advance, for any ideas or suggestions.

Juan Perin
Children's Hospital of Philadelphia