[Bioclusters] resources on administering clusters

25 Mar 2002 16:13:47 -0500

On Mon, 2002-03-25 at 13:23, Jeff Layton wrote:

> > Remote power control is nice because I can remotely kill or reboot nodes
> > that are misbehaving and I can also turn on and turn off the entire
> > cluster in a staged manner (so you don't blow your power circuits!)
> >
> > With these 2 tools in hand, this is what my admin philosophy becomes:
> >
> > (1) If a node is behaving, don't touch it
> > (2) If a node acts strangely use systemImager to automatically wipe the
> > disk and reinstall the OS from scratch (remotely)
> 
> I usually try to debug a node first before re-imaging it. I also plug into
> a node that is locked up to see if I can find out anything (Linux doesn't
> behave well under heavy memory pressure - "swapping itself to death").

The swapping-death comes from a number of places, VM issues in pre
2.4.16 kernels, and poor swap layout.  Generally speaking swapping is
not a good thing to do.  But sometimes good apps swap, so you should
make sure they can do it reasonably well.

First off, spread the swap to as many spindles as you can.  Under Linux,
you can "stripe" swap across multiple partitions.  If you have 4 disks,
then look at the possibility of using 4 equisized partitions (one per
disk) for swap.  This needs to be done at system build time.  Never ever
put all your swap on a single partition.  This is "A Bad Thing(TM)" and
leads to swap-death.

Second off, arrange the swap to the outermost cylinders of the disk. 
From the various benchmarks on places like Tom's Hardware and others, it
seems that you will get the highest I/O rates at the lower number
cylinders.  Even with small 18 GB disks, lopping off 0.5 GB per disk is
not terribly difficult.

Third off, buy enough RAM.  RAM is cheap.  Far too many groups make the
often painful decision that aggregate memory is important, and per CPU
memory is not.  This is not true for a memory hungry application (like
BLAST with large databases).  The time spent in swapping on a memory
starved system can often increase the runtime an order of magnitude or
more.  If you convert that into opportunity cost of being unable to use
the resource for other jobs while it is grinding away at yours, well,
you get the idea that the RAM pays for itself over and over again.

Fourth off, if you have the choice of buying a single big disk, or more
smaller disks, think of the calculus this way.  1 I/O pipe per disk, and
I can stripe my file systems.  So more smaller disks means more I/O
(local) bandwidth.  This is "A Good Thing(TM)".  Yes, you may argue that
it increases your risk and reduces the nodes MTBF.  A good node
regeneration and some spares cure that issue rather quickly.

There are other points that could be made, but swapping need not be
deadly.  If it is, you have a local disk I/O issue that desperately
needs to be solved.  Local I/O is very important to certain
applications.  You really do not want to be hitting a set of files hard
over an NFS mount.  That does not scale.

-- 

Joseph Landman, Ph.D.
Senior Scientist,
MSC Software High Performance Computing
email		: joe.landman@mscsoftware.com
messaging	: page_joe@mschpc.dtw.macsch.com
Main office	: +1 248 208 3312
Cell phone	: +1 734 612 4615
Fax		: +1 714 784 3774