Kris Boulez wrote: >What do people find good resources (books, websites, tools) to learn >more about administering a compute cluster. I'm not looking for general >unix sysadmin (been doing this for 10 years), but stuff which comes into >play when administering large numbers of machines. >I looked at the biobluster install diary Chris posted a few days ago, >but was wondering if people know of other resources. > >Kris, > Hey Kris- Most of the available printed or online clustering resources are either totally out of date or more often they are written from the perspective of people who: o Want to build tightly coupled supercomputer-like systems on the cheap that will only really run parallel apps ( 'beowulf') o People who are willing to do silly and complicated things in order to get the fastest possible performance at the expense of everything else including reliability and ease of management. There is this huge bias out there towards getting the fastest possible raw performance at the expense of literally everything else. Both of these aproaches are generally not cool for life science clusters which typically are not "beowulf-style" systems anyway. With some exceptions biologists don't build clusters designed to run a single instance of some massively parallel application at supercomputer speeds. Biologists tend to use clusters as a way of distributing their huge non-parallel ("embarassingly parallel") compute demands across many inexpensive, loosely coupled systems. The software layer that handles job scheduling, remote execution and dispatch is typically something like PBS, GridEngine or Platform's LSF suite. This is why I tend to use the term "compute farm" rather than "cluster" for most of the stuff I build. When it comes to administering large, loosely coupled systems used for life science research I have not found any good comprehensive books or online references. I do know that people are working on such things for OReilly and other publishers though... You may want to try seeing if there is anything useful up at the SourceForge Clustering foundry: http://foundries.sourceforge.net/clusters/ Anyone else have links? From my experience here are the 2 biggest pain points that I have found from a cluster admin perspective. If you can solve this to your (and your manager's) expectations then you in a very good position ! Knowing how to tackle these 2 things before you purchase your cluster is even better. heh. (**1**) Reducing administrative burden as much as possible This is your # 1 concern as a cluster administrator. The goal is to do everything possible to avoid having to treat and manage your cluster as dozens or hundreds of individual machines. When I was at Blackstone one of my internal research interests was figuring out how to make a 1,000 node cluster require only one half-time administrator to operate. It boils down to ruthlessly automating and scripting everything that is humanly possible. In an ideal world your cluster compute elements will then become: o anonymous (users should never care where their job actually runs) o interchangeble (if a node dies the workoad is migrated and a new server is brought online) o disposable (if a node breaks send it back to the vendor and pop in a cold spare *whenever convenient*) There are lots of methods for easing cluster administration. Some are commercial and some are free. I saw a company at the OReilly Bioinformatics Conference called LinuxNetworx (http://www.linuxnetworx.com/) that had these amazing "ICE boxes" in their rack that combined serial console, remote power control and temperature monitoring into one small package. Very cool - wish I could buy those as a standalone product. My biggest tools in this area are (a) SystemImager and (b) remote power control SystemImager (www.systemimager.org) kicks all kinds of ass. Using it I can completely install a cluster node from scratch without having to attach a keyboard or anything else. Just boot off an autoinstall CDROM or floppy or in some cases just a network-based PXE boot will do the trick. Besides automating the process of partitioning disks and installing the operating system and layered software SystemImager also allows you to incrementally push out changes which makes the process of installing or upgrading software or libraries pretty trivial. Remote power control is nice because I can remotely kill or reboot nodes that are misbehaving and I can also turn on and turn off the entire cluster in a staged manner (so you don't blow your power circuits!) With these 2 tools in hand, this is what my admin philosophy becomes: (1) If a node is behaving, don't touch it (2) If a node acts strangely use systemImager to automatically wipe the disk and reinstall the OS from scratch (remotely) (3) If a node acts strangely after it has been freshly imaged then remotely kill the power and leave it dead. (4) Whenever it is _convenient_ for me as an administrator take the dead node out and pop in a spare. Thanks to systemimager in about 6 minutes I'll have a fully operational cluster node that is again performing useful work. The dead node can either be diagnosed onsite (if you feel like it) or sent back to the vendor for replacement. No muss, No fuss. The key is to never waste time dealing with any individual machine. (**2**) Research and install your load management sotware carefully It makes me sad to see people go out and spent tens of thousands of dollars (or even more) on cluster hardware only to turn around and neglect the software side of things by throwing on a halfass default PBS rpm install and walking away. PBS may be free but it requires care and attention to get it configured and keep it online. Many people who don't do their due dillegence end up screwing themselves because they find that they need someone almost fulltime just to keep the darn load managent layer running. This is especially true for PBS where people are constantly finding themselves patching and recompiling the code from source. This is why I recommend LSF software from Platform. It may be expensive (really expensive...) but it installs in minutes, is easily configured and is way more stable then any of the competition (pbs, pbsPro, gridengine, etc). In the long run the reduced administrative burden and serious fault tolerance that LSF provides can make the cost of the commercial license very reasonable. Another alternative that is cheaper than LSF is to build the cluster yourself but hire professional consultants to come in and handle the tricky part of getting the load management system configured and tweaked. The good people at Veridian systems sell a commercial version of PBS called "PBSPro" that is reasonably priced. They'll even give you the source code if you need it. Paying Veridian for a few days of consulting time may be worth it if they leave you with a fully configured system that does not require lots of ongoing care and feeding. Damn I'm long winded today. -Chris -- Chris Dagdigian, <dag@sonsorol.org> Life Science IT & Research Computing Geek; http://BioTeam.net Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193 PGP KeyID: 83D4310E Yahoo IM: craffi