Hi John, I've seen, toured or been involved with a bunch of clustering projects over the years and I've never seen anyone really shoot for 5 nines uptime or whatever for their clusters. In most cases these are research systems and the owners have decided to forgo expensive HA clustering in favor of either (a) saving money or (b) plowing more money into storage, network or raw CPU power. This is true in my experience across academic, biotech and big pharma settings. These people understand that their cluster is a discovery research system and (ocasional) downtime is not going to be unusual. Most people I know consider downtimes of less than 4 hours or so (in hardware failure cases) to be par for the course. You will also find that hardware is not the most common failure case. Many times the cluster goes down because a user crashed the cluster or the DRM -- not a hardware cause at all. Joe hit it on the head --- there is a whole body of best practices in the application and robust RDBMS database server space that you can probably draw upon to learn what people are doing for HA stuff. It typically involves shared storage, IP failover and some sort of heartbeat mechanism between machines. I've heard good things about the Linux HA project but have never actually used it. Please keep the list informed; I'd be interested in seeing how this project goes. I have 2 suggestions for you to consider that won't come close to getting you to 100% uptime but they will get you closer and you won't have to spend tons of $$$ on special filesystems or cross-connected storage, IP switches etc. (1) Purchase a cold/warm spare head node; configure it to be ready to go or perhaps keep a set of clone disks on hand that you can throw in. If your cluster storage is seperate (ie NAS or external fileserver) then you can bring up a new head node in a few minutes and reimage it to bring it up to date in another couple of minutes. You may find your management and users are willing to put up with an hour or two of downtime in case of head node failure. This will save you time, $$ and complexity at the cost of some absolute downtime if the head node goes down. (2) I like this solution best -- why don't you configure multiple head nodes? It is trivial to add N more multi-homed servers to your cluster and the DRM software layers like Grid Engine and Platform LSF can be configured to fail over the scheduling and resource allocation daemons and all they need between themselves is a common NFS filesystem. Grid Engine has "shadow masters" that will activate upon failure of the master node and Platform LSF has a mechanism whereby the cluster will select, elect and promote a new machine to be the cluster master. If you combine multiple head nodes that are each capable of acting as the cluster scheduler and gateway then you can just get one of those simple load balancer boxes that companies sell into the web farm space -- these boxes do round-robin DNS or load-based balancing of IP traffic between boxes on the same subnet. -Chris Osborne, John wrote: > Hello, > > I'm the unofficial admin for a 20 node (40 CPU) linux cluster here at the > CDC and I'm looking for some advice. Our setup here relies upon a *single* > master node which acts as a gateway to the internal cluster network. If > something were to happen to the master node, we'd be in serious trouble if > we are aiming for 100% uptime. So far we aren't that serious about 100% > uptime (although we've had it for this master node thus far) but as the > popularity of the cluster grows it is becoming more important. I am > wondering what is the best way to ensure failover for a master node in a > cluster. Write now I just write out a master node image to network storage > every night and if something goes wrong, the cluster is effectively down and > it could take hours to get it fixed. > > Is it possible to have 2 master nodes with a single virtual IP address? How > are other people solving this problem? > > -John > > _______________________________________________ > Bioclusters maillist - Bioclusters@bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters -- Chris Dagdigian, <dag@sonsorol.org> BioTeam Inc. - Independent Bio-IT & Informatics consulting Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193 PGP KeyID: 83D4310E Yahoo IM: craffi Web: http://bioteam.net