[Bioclusters] gpfs overload on ibm bladecenter cluster

Kathleen kathleen at massivelyparallel.com
Thu Jan 26 08:07:34 EST 2006


Hi Hershel:

We use an alternate GPFS running on SuSE and tweak the hardware a bit
because we find that the number of channels, not necessarily type, but
number can significantly alter performance through better data
scheduling/routing without adding overhead.  Just curious, what apps are you
trying to run on these small clusters, how many nodes/processors/channels?  

Cheers,

Kathleen

-----Original Message-----
From: Guy Coates [mailto:gmpc at sanger.ac.uk] 
Sent: Thursday, January 26, 2006 2:28 AM
To: Clustering, compute farming & distributed computing in life science
informatics
Subject: Re: [Bioclusters] gpfs overload on ibm bladecenter cluster

On Thu, 26 Jan 2006, Hershel Safer wrote:

> We're running two small IBM BladeCenter clusters under SuSE, with GPFS 
> for (we hope) fast file I/O. It seems to us that when user processes 
> on a blade are particularly memory intensive, and GPFS needs to 
> compete for a resource (memory in this case), GPFS most likely won't
survive the competition and will die.

Recent kernels have an entry in

/proc/<PID>/oom_adj

If you echo a low number in there (google for sensible values) it will
protect processes (eg GPFS ones) from being zapped by the
out-of-memory-killer.

You can also put a high number in there for user processes, so those are the
first against the wall, come the revolution.

You can also enforce per-process memory limits (/etc/security/limits.conf)
or with your job schedular, if you run one.

You might also consider not running jobs on the machines which are GPFS NSD
servers.


We primarily use job-schedular enforced limits, which seem to work well for
us.

Cheers,

Guy




This may happen on one or more nodes of the cluster. The GPFS daemon
> 'mmfsd' will lose its connection to other members of the cluster and 
> lose its GPFS filesystem mounts, and consequently any services that 
> reside on GPFS will fail. The blade will not necessarily crash after that;
it may stay afloat may even be accessible via ssh.
>
> Have others encountered this situation? How can we prevent this 
> behavior? More generally, what kinds of limits do you impose on 
> consumption of resources such as memory and CPU? Thanks,
>
> Hershel
>
>
> ______________________________________________________________________
> _________________________________
> Hershel M. Safer, Ph.D.
> Chair, 5th European Conference on Computational Biology (ECCB '06) 
> Head, Bioinformatics Core Facility Weizmann Institute of Science PO 
> Box 26, Rehovot 76100, Israel
> tel: +972-8-934-3456 | fax: +972-8-934-6006
> e-mail: hershel.safer at weizmann.ac.il | hsafer at alum.mit.edu
> url: http://bioportal.weizmann.ac.il
>
> ***************************************************
> Plan now for ECCB '06!
> 5th European Conference on Computational Biology Eilat, Israel, Sept 
> 10 -- 13, 2006 Visit www.eccb06.org for details
>

-- 
Dr. Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 494919

_______________________________________________
Bioclusters maillist  -  Bioclusters at bioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters






More information about the Bioclusters mailing list