[Bioclusters] resources on administering clusters

Mon, 25 Mar 2002 13:23:39 -0500

chris dagdigian wrote:

> Kris Boulez wrote:
>
> >What do people find good resources (books, websites, tools) to learn
> >more about administering a compute cluster. I'm not looking for general
> >unix sysadmin (been doing this for 10 years), but stuff which comes into
> >play when administering large numbers of machines.
> >I looked at the biobluster install diary Chris posted a few days ago,
> >but was wondering if people know of other resources.
> >
> >Kris,
> >
>
> Hey Kris-
>
> Most of the available printed or online clustering resources are either
> totally out of date or more often they are written from the perspective
> of people who:
>
> o Want to build tightly coupled supercomputer-like systems on the cheap
> that will only really run parallel apps ( 'beowulf')
>
> o People who are willing to do silly and complicated things  in order to
> get the fastest possible performance at the expense of everything else
> including reliability and ease of management. There is this huge bias
> out there towards getting the fastest possible raw performance at the
> expense of literally everything else.

I'll jump in and provide some perspective on someone who has
admined, programmed, and built these high-performance
clusters and is only starting to understand Biocomputing (but I
think I understand some of the general principles).

First, I take issue with the idea that because I'm interested in
pure speed, that I don't take the time or care to develop or find or
buy the tools I need to admin the clusters effectively and efficiently.
I probably do the same things you do, but instead of coming up with
tools that ease my admin burden, I come up with tools that ease
my admin BUT don't compromise speed. I don't now if people
document what they do in making these tools, but if you start searching
the web, you'll find them (at least I did).

>
>
> Both of these aproaches are generally not cool for life science clusters
> which typically are not "beowulf-style" systems anyway.
>
> With some exceptions biologists don't build clusters designed to run a
> single instance of some massively parallel application at supercomputer
> speeds. Biologists tend to use clusters as a way of distributing their
> huge non-parallel ("embarassingly parallel") compute demands across many
> inexpensive, loosely coupled systems. The software layer that handles
> job scheduling, remote execution and dispatch is typically something
> like PBS, GridEngine or Platform's LSF suite.
>
> This is why I tend to use the term "compute farm" rather than "cluster"
> for most of the stuff I build.

Agreed.

>
>
> When it comes to administering large, loosely coupled systems used for
> life science research I have not found any good comprehensive books or
> online references. I do know that people are working on such things for
> OReilly and other publishers though...
>
> You may want to try seeing if there is anything useful up at the
> SourceForge Clustering foundry: http://foundries.sourceforge.net/clusters/
>
> Anyone else have links?
>
>  From my  experience here are the 2  biggest pain points that I have
> found from a cluster admin perspective. If you can solve this to your
> (and your manager's) expectations then you in a very good position !
> Knowing how to tackle these 2 things before you purchase your cluster is
> even better. heh.
>
> (**1**) Reducing administrative burden as much as possible
>
> This is your # 1 concern as a cluster administrator. The goal is to do
> everything possible to avoid having to treat and manage your cluster as
> dozens or hundreds of individual machines.  When I was at Blackstone one
> of my internal research interests was figuring out how to make a 1,000
> node cluster require only one half-time administrator to operate.

I can give you some numbers that I found over the last few years.
We have two 64 node, dual CPU machines with a separate master
node. I'll only give you number for one of them because of issues
that I explain below. This cluster was purchased from a vendor and
I think I estimated admin time at less than 2 hours a week.

If the cluster is well designed, uses good components, is tuned
properly (probably not an issue for Bio-clusters), and burned in
well, then usually you won't loose many nodes. For this one cluster,
in 2 years of 24/7 use, we only lost 2 hard drives. No power supplies,
no NICs, no CPUs, no memory, no anything else.

YMMV

>
>
> It boils down to ruthlessly automating and scripting everything that is
> humanly possible. In an ideal world your cluster compute elements will
> then become:
>
> o anonymous  (users should never care where their job actually runs)
> o interchangeble (if a node dies the workoad is migrated and  a new
> server is brought online)
> o disposable  (if a node breaks send it back to the vendor and pop in a
> cold spare *whenever convenient*)
>
> There are lots of methods for easing cluster administration. Some are
> commercial and some are free. I saw a company at the OReilly
> Bioinformatics Conference called LinuxNetworx
> (http://www.linuxnetworx.com/) that had these amazing "ICE boxes" in
> their rack that combined serial console, remote power control and
> temperature monitoring into one small package. Very cool - wish I could
> buy those as a standalone product.

I have a cluster from Linux Networx that has these ICE boxes. I have
used them very little (if at all). I can give a long perspective on this
machine
if you like. I won't do it publicly though. Just let me say, that after a year

we were ready to throw the machine out the window and we found somebody
else to support their system.

However, I'd never think about buying or building a cluster without some
sort of remote power control. The bad thing about the ICE boxes is that
you can only get them from one place. You can buy COTS remote power
units very cheaply off the web from a number of places.

>
>
> My biggest tools in this area are (a) SystemImager and (b) remote power
> control
>
> SystemImager (www.systemimager.org) kicks all kinds of ass. Using it I
> can completely install a cluster node from scratch without having to
> attach a keyboard or anything else. Just boot off an autoinstall CDROM
> or floppy or in some cases just a network-based PXE boot will do the trick.
>
> Besides automating the process of partitioning disks and installing the
> operating system and layered software SystemImager also allows you to
> incrementally push out changes which makes the process of installing or
> upgrading software or libraries pretty trivial.

I use kickstart since we're Redhat based. Very easy to use (especially
with a vendor behind you helping you :). Rebuilds nodes very quickly.

>
>
> Remote power control is nice because I can remotely kill or reboot nodes
> that are misbehaving and I can also turn on and turn off the entire
> cluster in a staged manner (so you don't blow your power circuits!)
>
> With these 2 tools in hand, this is what my admin philosophy becomes:
>
> (1) If a node is behaving, don't touch it
> (2) If a node acts strangely use systemImager to automatically wipe the
> disk and reinstall the OS from scratch (remotely)

I usually try to debug a node first before re-imaging it. I also plug into
a node that is locked up to see if I can find out anything (Linux doesn't
behave well under heavy memory pressure - "swapping itself to death").

>
> (3) If a node acts strangely after it has been freshly imaged then
> remotely kill the power and leave it dead.
> (4) Whenever it is _convenient_ for me as an administrator take the dead
> node out and pop in a spare. Thanks to systemimager in about 6 minutes
> I'll have a fully operational cluster node that is again performing
> useful work. The dead node can either be diagnosed onsite (if you feel
> like it) or sent back to the vendor for replacement.

Agreed.

>
>
> No muss, No fuss. The key is to never waste time dealing with any
> individual machine.
>
> (**2**) Research and install your load management sotware carefully
>
> It makes me sad to see people go out and spent tens of thousands of
> dollars (or even more) on cluster hardware only to turn around and
> neglect the software side of things by throwing on a halfass default PBS
> rpm install and walking away.
>
> PBS may be free but it requires care and attention to get it configured
> and keep it online. Many people who don't do their due dillegence end up
> screwing themselves because they find that they need someone almost
> fulltime just to keep the darn load managent layer running. This is
> especially true for PBS where people are constantly finding themselves
> patching and recompiling the code from source.

Agreed. However, once PBS was up in the fashion we wanted,
it worked just great! I never touch it, restart it, patch it, etc. It just
works well (don't forget to mark nodes that are down as offline).
The learning curve for PBS was steeper than I had thought, but after
it was up - it just worked.

>
>
> This is why I recommend LSF software from Platform. It may be expensive
> (really expensive...) but it installs in minutes, is easily configured
> and is way more stable then any of the competition (pbs, pbsPro,
> gridengine, etc). In the long run the reduced administrative burden and
> serious fault tolerance that LSF provides can make the cost of the
> commercial license very reasonable.

Define stable. We've had PBS running for at least 8 months without a
restart (the only reason it was restarted was to upgrade the CPUS in
the nodes and to move the machine to a different location).

I've had exceptional luck with it and I'm sure other users have
to. Pop on to the PBS mailing list and ask people who have tried
all of the systems.

In fact, PBS is stable enough for us and is much cheaper than LSF
that as a Company, we are switching to PBSPro (we're been using
OpenPBS, but another site has been using LSF for years).

>
>
> Another alternative that is cheaper than LSF is to build the cluster
> yourself but hire professional consultants to come in and handle the
> tricky part of getting the load management system configured and
> tweaked. The good people at Veridian systems sell a commercial version
> of PBS called "PBSPro" that is reasonably priced. They'll even give you
> the source code if you need it. Paying Veridian for a few days of
> consulting time may be worth it if they leave you with a fully
> configured system that does not require lots of ongoing care and feeding.

Agreed. Take the PBS class if you can. Otherwise, it's on the job training
like I had :( The people on the PBS mailing list are pretty good and the
Veridian folks chime in fairly often).

Just some issues I had to get off my chest.

Jeff Layton

Lockheed-Martin Aeronautical Company - Marietta

>
>
> Damn I'm long winded today.
>
> -Chris
>
> --
> Chris Dagdigian, <dag@sonsorol.org>
> Life Science IT & Research Computing Geek; http://BioTeam.net
> Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193
> PGP KeyID: 83D4310E  Yahoo IM: craffi
>
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> http://bioinformatics.org/mailman/listinfo/bioclusters