On Mon, 2002-03-25 at 13:23, Jeff Layton wrote: > > Remote power control is nice because I can remotely kill or reboot nodes > > that are misbehaving and I can also turn on and turn off the entire > > cluster in a staged manner (so you don't blow your power circuits!) > > > > With these 2 tools in hand, this is what my admin philosophy becomes: > > > > (1) If a node is behaving, don't touch it > > (2) If a node acts strangely use systemImager to automatically wipe the > > disk and reinstall the OS from scratch (remotely) > > I usually try to debug a node first before re-imaging it. I also plug into > a node that is locked up to see if I can find out anything (Linux doesn't > behave well under heavy memory pressure - "swapping itself to death"). The swapping-death comes from a number of places, VM issues in pre 2.4.16 kernels, and poor swap layout. Generally speaking swapping is not a good thing to do. But sometimes good apps swap, so you should make sure they can do it reasonably well. First off, spread the swap to as many spindles as you can. Under Linux, you can "stripe" swap across multiple partitions. If you have 4 disks, then look at the possibility of using 4 equisized partitions (one per disk) for swap. This needs to be done at system build time. Never ever put all your swap on a single partition. This is "A Bad Thing(TM)" and leads to swap-death. Second off, arrange the swap to the outermost cylinders of the disk. From the various benchmarks on places like Tom's Hardware and others, it seems that you will get the highest I/O rates at the lower number cylinders. Even with small 18 GB disks, lopping off 0.5 GB per disk is not terribly difficult. Third off, buy enough RAM. RAM is cheap. Far too many groups make the often painful decision that aggregate memory is important, and per CPU memory is not. This is not true for a memory hungry application (like BLAST with large databases). The time spent in swapping on a memory starved system can often increase the runtime an order of magnitude or more. If you convert that into opportunity cost of being unable to use the resource for other jobs while it is grinding away at yours, well, you get the idea that the RAM pays for itself over and over again. Fourth off, if you have the choice of buying a single big disk, or more smaller disks, think of the calculus this way. 1 I/O pipe per disk, and I can stripe my file systems. So more smaller disks means more I/O (local) bandwidth. This is "A Good Thing(TM)". Yes, you may argue that it increases your risk and reduces the nodes MTBF. A good node regeneration and some spares cure that issue rather quickly. There are other points that could be made, but swapping need not be deadly. If it is, you have a local disk I/O issue that desperately needs to be solved. Local I/O is very important to certain applications. You really do not want to be hitting a set of files hard over an NFS mount. That does not scale. -- Joseph Landman, Ph.D. Senior Scientist, MSC Software High Performance Computing email : joe.landman@mscsoftware.com messaging : page_joe@mschpc.dtw.macsch.com Main office : +1 248 208 3312 Cell phone : +1 734 612 4615 Fax : +1 714 784 3774