[Bioclusters] gpfs overload on ibm bladecenter cluster
Guy Coates
gmpc at sanger.ac.uk
Thu Jan 26 04:28:03 EST 2006
On Thu, 26 Jan 2006, Hershel Safer wrote:
> We're running two small IBM BladeCenter clusters under SuSE, with GPFS for (we hope) fast file
> I/O. It seems to us that when user processes on a blade are particularly memory intensive, and
> GPFS needs to compete for a resource (memory in this case), GPFS most likely won't survive the
> competition and will die.
Recent kernels have an entry in
/proc/<PID>/oom_adj
If you echo a low number in there (google for sensible values) it will
protect processes (eg GPFS ones) from being zapped by the
out-of-memory-killer.
You can also put a high number in there for user processes, so those are
the first against the wall, come the revolution.
You can also enforce per-process memory limits (/etc/security/limits.conf)
or with your job schedular, if you run one.
You might also consider not running jobs on the machines which are GPFS
NSD servers.
We primarily use job-schedular enforced limits, which seem to work well
for us.
Cheers,
Guy
This may happen on one or more nodes of the cluster. The GPFS daemon
> 'mmfsd' will lose its connection to other members of the cluster and lose its GPFS filesystem
> mounts, and consequently any services that reside on GPFS will fail. The blade will not
> necessarily crash after that; it may stay afloat may even be accessible via ssh.
>
> Have others encountered this situation? How can we prevent this behavior? More generally, what
> kinds of limits do you impose on consumption of resources such as memory and CPU? Thanks,
>
> Hershel
>
>
> _______________________________________________________________________________________________________
> Hershel M. Safer, Ph.D.
> Chair, 5th European Conference on Computational Biology (ECCB '06)
> Head, Bioinformatics Core Facility
> Weizmann Institute of Science
> PO Box 26, Rehovot 76100, Israel
> tel: +972-8-934-3456 | fax: +972-8-934-6006
> e-mail: hershel.safer at weizmann.ac.il | hsafer at alum.mit.edu
> url: http://bioportal.weizmann.ac.il
>
> ***************************************************
> Plan now for ECCB '06!
> 5th European Conference on Computational Biology
> Eilat, Israel, Sept 10 -- 13, 2006
> Visit www.eccb06.org for details
>
--
Dr. Guy Coates, Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 494919
More information about the Bioclusters
mailing list