[Bioclusters] Mount problems

23 Jul 2003 09:15:38 -0400

Hi David:

  See http://nfs.sourceforge.net/.  This could be a network card issue,
a driver issue, a physical network issue.  My experience with running
into this is that you typically get these when you have the NFS server
connected on the same speed link as the rest of the net (e.g. all 100
base T), the NFS server machine isn't particularly fast, or you have a
buggy driver.  As you are using 7.1 RH, I might recommend you at least
update the kernel (if it is not a late model) and the NFS tools.

  I might also suggest changing the soft to a hard mount.  All this will
do is force an infinite number of retries, while soft will silently
fail.  The silent failure does not pass information back to the
application properly, so the application blocks on IO, consumes cycles,
and eventually causes problems as it is unkillable.  Also make sure you
use intr as an option.

Joe

On Wed, 2003-07-23 at 08:37, david speed (RI) wrote:
> Hi All,
> 
> We have installed SGE onto our 15-node Linux (Red Hat 7.1) cluster (30 Intel CPUs). There is an NFS export mounted from the head node to each slave node solely to contain the SGE tools and directories.  We have installed the ncbi blast tools and the databases to be blasted against locally on each node.
> 
> When running test batches of blasts on Grid Engine (random) nodes will go into an error state due to (we think) the node being unable to access the SGE mount, the running job process remains in a RW status till the machine is rebooted (by pulling the plug  the shutdown command fails).  The process is running at 99.9 %cpu, the sge_shepherd process has S< status
> 
> Running the mount command lists the SGE mount as normal and we can cd into the SGE mount as normal however df causes the shell to hang (it outputs info on the other mounts but hangs just as it should output the SGE mount info)
> 
> The options we have used in fstab for the SGE mount are nfs	exec,dev,suid.rw,bg,soft,intr 0 0
> 
> The /var/log/messages file has entries similar to
> 
> kernel: nfs: task 3077 can't get a request slot
> 
> Anyone any idea what the problem is
> 
> David
> 
> 
> David Speed
> Programmer
> Roslin Institute
> Bioinformatics Group
> Roslin, 
> Midlothian, 
> EH25 9PS, 
> UK
> Telephone: +44 (0)131 527 4200 (switchboard) 
> Fax: +44 (0)131 440 0434
> 
> The information contained in this e-mail (including any attachments) is confidential and is intended for the use of the addressee only. The opinions expressed within this e-mail (including any attachments) are the opinions of the sender and do not necessarily constitute those of Roslin Institute (Edinburgh) ("the Institute") unless specifically stated by a sender who is duly authorised to do so on behalf of the Institute.
> 
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
-- 
Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman@scalableinformatics.com
  web: http://scalableinformatics.com
phone: +1 734 612 4615