[Bioclusters] gpfs overload on ibm bladecenter cluster
Kathleen
kathleen at massivelyparallel.com
Thu Jan 26 08:07:34 EST 2006
Hi Hershel:
We use an alternate GPFS running on SuSE and tweak the hardware a bit
because we find that the number of channels, not necessarily type, but
number can significantly alter performance through better data
scheduling/routing without adding overhead. Just curious, what apps are you
trying to run on these small clusters, how many nodes/processors/channels?
Cheers,
Kathleen
-----Original Message-----
From: Guy Coates [mailto:gmpc at sanger.ac.uk]
Sent: Thursday, January 26, 2006 2:28 AM
To: Clustering, compute farming & distributed computing in life science
informatics
Subject: Re: [Bioclusters] gpfs overload on ibm bladecenter cluster
On Thu, 26 Jan 2006, Hershel Safer wrote:
> We're running two small IBM BladeCenter clusters under SuSE, with GPFS
> for (we hope) fast file I/O. It seems to us that when user processes
> on a blade are particularly memory intensive, and GPFS needs to
> compete for a resource (memory in this case), GPFS most likely won't
survive the competition and will die.
Recent kernels have an entry in
/proc/<PID>/oom_adj
If you echo a low number in there (google for sensible values) it will
protect processes (eg GPFS ones) from being zapped by the
out-of-memory-killer.
You can also put a high number in there for user processes, so those are the
first against the wall, come the revolution.
You can also enforce per-process memory limits (/etc/security/limits.conf)
or with your job schedular, if you run one.
You might also consider not running jobs on the machines which are GPFS NSD
servers.
We primarily use job-schedular enforced limits, which seem to work well for
us.
Cheers,
Guy
This may happen on one or more nodes of the cluster. The GPFS daemon
> 'mmfsd' will lose its connection to other members of the cluster and
> lose its GPFS filesystem mounts, and consequently any services that
> reside on GPFS will fail. The blade will not necessarily crash after that;
it may stay afloat may even be accessible via ssh.
>
> Have others encountered this situation? How can we prevent this
> behavior? More generally, what kinds of limits do you impose on
> consumption of resources such as memory and CPU? Thanks,
>
> Hershel
>
>
> ______________________________________________________________________
> _________________________________
> Hershel M. Safer, Ph.D.
> Chair, 5th European Conference on Computational Biology (ECCB '06)
> Head, Bioinformatics Core Facility Weizmann Institute of Science PO
> Box 26, Rehovot 76100, Israel
> tel: +972-8-934-3456 | fax: +972-8-934-6006
> e-mail: hershel.safer at weizmann.ac.il | hsafer at alum.mit.edu
> url: http://bioportal.weizmann.ac.il
>
> ***************************************************
> Plan now for ECCB '06!
> 5th European Conference on Computational Biology Eilat, Israel, Sept
> 10 -- 13, 2006 Visit www.eccb06.org for details
>
--
Dr. Guy Coates, Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 494919
_______________________________________________
Bioclusters maillist - Bioclusters at bioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters
More information about the Bioclusters
mailing list