[Bioclusters] resources on administering clusters

Mon, 25 Mar 2002 12:05:45 -0500

Kris Boulez wrote:

>What do people find good resources (books, websites, tools) to learn
>more about administering a compute cluster. I'm not looking for general
>unix sysadmin (been doing this for 10 years), but stuff which comes into
>play when administering large numbers of machines.
>I looked at the biobluster install diary Chris posted a few days ago,
>but was wondering if people know of other resources.
>
>Kris,
>

Hey Kris-

Most of the available printed or online clustering resources are either 
totally out of date or more often they are written from the perspective 
of people who:

o Want to build tightly coupled supercomputer-like systems on the cheap 
that will only really run parallel apps ( 'beowulf')

o People who are willing to do silly and complicated things  in order to 
get the fastest possible performance at the expense of everything else 
including reliability and ease of management. There is this huge bias 
out there towards getting the fastest possible raw performance at the 
expense of literally everything else.

Both of these aproaches are generally not cool for life science clusters 
which typically are not "beowulf-style" systems anyway.

With some exceptions biologists don't build clusters designed to run a 
single instance of some massively parallel application at supercomputer 
speeds. Biologists tend to use clusters as a way of distributing their 
huge non-parallel ("embarassingly parallel") compute demands across many 
inexpensive, loosely coupled systems. The software layer that handles 
job scheduling, remote execution and dispatch is typically something 
like PBS, GridEngine or Platform's LSF suite.

This is why I tend to use the term "compute farm" rather than "cluster" 
for most of the stuff I build.

When it comes to administering large, loosely coupled systems used for 
life science research I have not found any good comprehensive books or 
online references. I do know that people are working on such things for 
OReilly and other publishers though...

You may want to try seeing if there is anything useful up at the 
SourceForge Clustering foundry: http://foundries.sourceforge.net/clusters/

Anyone else have links?

 From my  experience here are the 2  biggest pain points that I have 
found from a cluster admin perspective. If you can solve this to your 
(and your manager's) expectations then you in a very good position ! 
Knowing how to tackle these 2 things before you purchase your cluster is 
even better. heh.

(**1**) Reducing administrative burden as much as possible

This is your # 1 concern as a cluster administrator. The goal is to do 
everything possible to avoid having to treat and manage your cluster as 
dozens or hundreds of individual machines.  When I was at Blackstone one 
of my internal research interests was figuring out how to make a 1,000 
node cluster require only one half-time administrator to operate.

It boils down to ruthlessly automating and scripting everything that is 
humanly possible. In an ideal world your cluster compute elements will 
then become:

o anonymous  (users should never care where their job actually runs)
o interchangeble (if a node dies the workoad is migrated and  a new 
server is brought online)
o disposable  (if a node breaks send it back to the vendor and pop in a 
cold spare *whenever convenient*)

There are lots of methods for easing cluster administration. Some are 
commercial and some are free. I saw a company at the OReilly 
Bioinformatics Conference called LinuxNetworx 
(http://www.linuxnetworx.com/) that had these amazing "ICE boxes" in 
their rack that combined serial console, remote power control and 
temperature monitoring into one small package. Very cool - wish I could 
buy those as a standalone product.

My biggest tools in this area are (a) SystemImager and (b) remote power 
control

SystemImager (www.systemimager.org) kicks all kinds of ass. Using it I 
can completely install a cluster node from scratch without having to 
attach a keyboard or anything else. Just boot off an autoinstall CDROM 
or floppy or in some cases just a network-based PXE boot will do the trick.

Besides automating the process of partitioning disks and installing the 
operating system and layered software SystemImager also allows you to 
incrementally push out changes which makes the process of installing or 
upgrading software or libraries pretty trivial.

Remote power control is nice because I can remotely kill or reboot nodes 
that are misbehaving and I can also turn on and turn off the entire 
cluster in a staged manner (so you don't blow your power circuits!)

With these 2 tools in hand, this is what my admin philosophy becomes:

(1) If a node is behaving, don't touch it
(2) If a node acts strangely use systemImager to automatically wipe the 
disk and reinstall the OS from scratch (remotely)
(3) If a node acts strangely after it has been freshly imaged then 
remotely kill the power and leave it dead.
(4) Whenever it is _convenient_ for me as an administrator take the dead 
node out and pop in a spare. Thanks to systemimager in about 6 minutes 
I'll have a fully operational cluster node that is again performing 
useful work. The dead node can either be diagnosed onsite (if you feel 
like it) or sent back to the vendor for replacement.

No muss, No fuss. The key is to never waste time dealing with any 
individual machine.

(**2**) Research and install your load management sotware carefully

It makes me sad to see people go out and spent tens of thousands of 
dollars (or even more) on cluster hardware only to turn around and 
neglect the software side of things by throwing on a halfass default PBS 
rpm install and walking away.

PBS may be free but it requires care and attention to get it configured 
and keep it online. Many people who don't do their due dillegence end up 
screwing themselves because they find that they need someone almost 
fulltime just to keep the darn load managent layer running. This is 
especially true for PBS where people are constantly finding themselves 
patching and recompiling the code from source.

This is why I recommend LSF software from Platform. It may be expensive 
(really expensive...) but it installs in minutes, is easily configured 
and is way more stable then any of the competition (pbs, pbsPro, 
gridengine, etc). In the long run the reduced administrative burden and 
serious fault tolerance that LSF provides can make the cost of the 
commercial license very reasonable.

Another alternative that is cheaper than LSF is to build the cluster 
yourself but hire professional consultants to come in and handle the 
tricky part of getting the load management system configured and 
tweaked. The good people at Veridian systems sell a commercial version 
of PBS called "PBSPro" that is reasonably priced. They'll even give you 
the source code if you need it. Paying Veridian for a few days of 
consulting time may be worth it if they leave you with a fully 
configured system that does not require lots of ongoing care and feeding.

Damn I'm long winded today.

-Chris

-- 
Chris Dagdigian, <dag@sonsorol.org>
Life Science IT & Research Computing Geek; http://BioTeam.net
Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193
PGP KeyID: 83D4310E  Yahoo IM: craffi