[Bioclusters] Mount problems

Wed, 23 Jul 2003 10:06:21 -0700

David-

I've had many problems in the past similar to this.  Almost without 
exception, I've been able to solve that particular error by adding the 
"TCP" option to your NFS mounts.  In busy or congested environments 
(where you will see dropped packets) the UDP reassembly mechanism (in 
NFS) seem to fail, whereas the TCP reassembly (done by the stack) is 
spot-on.

However, I don't know if that option will be available in 7.1.

Oh, and I've never seen soft mounts work.  Much better to go with hard 
mounts as Mr. Landman suggests.

Happy hunting.

Michael
Michael Gutteridge                  Fred Hutchinson Cancer
Unix System Administrator                  Research Center
mgutteri(a)fhcrc.org              Research Computing Support
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Views expressed do not necessarily represent those of my
employer. Caveat emptor, your mileage may vary, warranty
not valid if seal is broken.  Share and enjoy.

On Wednesday, Jul 23, 2003, at 06:15 US/Pacific, Joseph Landman wrote:

> Hi David:
>
>   See http://nfs.sourceforge.net/.  This could be a network card issue,
> a driver issue, a physical network issue.  My experience with running
> into this is that you typically get these when you have the NFS server
> connected on the same speed link as the rest of the net (e.g. all 100
> base T), the NFS server machine isn't particularly fast, or you have a
> buggy driver.  As you are using 7.1 RH, I might recommend you at least
> update the kernel (if it is not a late model) and the NFS tools.
>
>   I might also suggest changing the soft to a hard mount.  All this 
> will
> do is force an infinite number of retries, while soft will silently
> fail.  The silent failure does not pass information back to the
> application properly, so the application blocks on IO, consumes cycles,
> and eventually causes problems as it is unkillable.  Also make sure you
> use intr as an option.
>
> Joe
>
> On Wed, 2003-07-23 at 08:37, david speed (RI) wrote:
>> Hi All,
>>
>> We have installed SGE onto our 15-node Linux (Red Hat 7.1) cluster 
>> (30 Intel CPUs). There is an NFS export mounted from the head node to 
>> each slave node solely to contain the SGE tools and directories.  We 
>> have installed the ncbi blast tools and the databases to be blasted 
>> against locally on each node.
>>
>> When running test batches of blasts on Grid Engine (random) nodes 
>> will go into an error state due to (we think) the node being unable 
>> to access the SGE mount, the running job process remains in a RW 
>> status till the machine is rebooted (by pulling the plug  the 
>> shutdown command fails).  The process is running at 99.9 %cpu, the 
>> sge_shepherd process has S< status
>>
>> Running the mount command lists the SGE mount as normal and we can cd 
>> into the SGE mount as normal however df causes the shell to hang (it 
>> outputs info on the other mounts but hangs just as it should output 
>> the SGE mount info)
>>
>> The options we have used in fstab for the SGE mount are 
>> nfs	exec,dev,suid.rw,bg,soft,intr 0 0
>>
>> The /var/log/messages file has entries similar to
>>
>> kernel: nfs: task 3077 can't get a request slot
>>
>> Anyone any idea what the problem is
>>
>> David
>>
>>
>> David Speed
>> Programmer
>> Roslin Institute
>> Bioinformatics Group
>> Roslin,
>> Midlothian,
>> EH25 9PS,
>> UK
>> Telephone: +44 (0)131 527 4200 (switchboard)
>> Fax: +44 (0)131 440 0434
>>
>> The information contained in this e-mail (including any attachments) 
>> is confidential and is intended for the use of the addressee only. 
>> The opinions expressed within this e-mail (including any attachments) 
>> are the opinions of the sender and do not necessarily constitute 
>> those of Roslin Institute (Edinburgh) ("the Institute") unless 
>> specifically stated by a sender who is duly authorised to do so on 
>> behalf of the Institute.
>>
>>
>> _______________________________________________
>> Bioclusters maillist  -  Bioclusters@bioinformatics.org
>> https://bioinformatics.org/mailman/listinfo/bioclusters
> -- 
> Joseph Landman, Ph.D
> Scalable Informatics LLC
> email: landman@scalableinformatics.com
>   web: http://scalableinformatics.com
> phone: +1 734 612 4615
>
>
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>