[Bioclusters] Request for discussions-How to build a biocluster Part 3 (the OS)

Sun, 19 May 2002 00:00:03 -0400 (EDT)

On 18 May 2002, Mike Coleman wrote:
> Donald Becker <becker@scyld.com> writes:
> > With the Scyld system, compute nodes are dramatically simplified.  They
> > run a fully capable standard kernel with extensions, and start out with
> > no file system (actually a RAM-based filesystem).
> >
> > There are many advantages of this approach.
> >   Adding new compute nodes is fast and automatic
> >   The system is easily scalable to over a thousand nodes
> >   Single-point updates for kernel, device drivers, libraries and applications
> >   Jobs run faster on compute nodes than a full installation
>
> For the sake of argument, I'm comparing this with an nfsroot setup (with a
> common /etc for all slaves, /var on ramdisk, /dev on devfs, everything else
> mounted read-only straight off the master, except for application files
> mounted r/w).

Setting up a NFS root isn't quite as simple as that.  There are many
exceptions: most of /etc/ can be shared, but a few files
(e.g. /etc/fstab) must exist per-node.  There are enough exceptions in
different directories that most NFS root systems use a explicitly
created file tree on the file server for each node in the cluster.

>  It looks like points 1 and 3 would work the same and I don't
> see why point #4 would be true.

The faster execution is the the effect of
    - cluster directory (name) services
    - wired-down core libraries
    - no housekeeping daemons
    - integrated MPI and PVM
    - fast BProc job initiation

Examples of point #1 are
   Using cluster node names such as ".23".  Destination addresses can
   be calculated by knowing the IP address of ".0".
   Sending user names with processes to avoid /etc/passwd look-ups or
   network traffic

That last two points relate only to job start-up time.
Cluster directory services and the clean BProc semantics make MPI and
PVM job start-up much simpler.  BProc reduce the node process creation
time to about 1/10 and 1/20 the time of 'rsh' and 'ssh'.

Job start-up time isn't important for long-running jobs, but it is
important if you want to use a cluster as an interactive machine rather
than a batch system.

> It does seem like scalability is something you'd have to keep an eye on.  In
> the comparison setup, you're read-only mounting a lot of files off of the
> master.  I'm not sure how many hosts you can do this with before you start
> running into trouble, but it does seem like it should scale somewhat
> (at least with parameter tweaking, as you pointed out).

Clusters are a much different situation than workstations -- all nodes
want the same services at the same time.
Using NFS root with default parameters scales very poorly.  You can do
much better by using multiple NFS mounts with different attributes and
caching parameters.  But this requires expertise, and on-going
monitoring.  Just managing NFS quickly becomes complex, both for users
and administrators.

> To me, BProc seems considerably more complex.  You pretty much have to
> understand BProc.

You only need to understand BProc if you are writing system tools.  End
users, administrators and MPI developers don't need to learn new
concepts.

> > The user sees BProc as the unified process space over the cluster.  They
> > can see and control all processes of their job using Unix tools they
> > already know, such as a 'top', 'ps', 'suspend' and 'kill'.
>
> Yes, and this is really nice.  But BProc also seems to be stretching POSIX
> pretty hard.

In what way is it streching POSIX?  Are there any semantics that are
questionable?

> In normal Linux, when I kill(2) a process, failure (as in
> communications failure) is not a possibility.  It seems like a lot of subtle
> failure modes could be hiding here.

[[ Please avoid innuendo such as that last statement.  You are implying
a problem without being specific what it might be or standing behind the
claim. ]]

I am often asked about the semantics of node failure.
First, much like a segmentation violation in a local child process,
that's not the normal or typical operation.

Process control takes place over reliable streams, and failure policy is
implemented in a user-level control process.  When the master can no
longer communicate with a slave process, the process is considered to
have died abnormally.  You won't be able to get process exit
accounting information, but there are few other differences from a local
process.

Application recovery after a node failure still has to be handled on a
case-by-case basis.  But this is always true.  Most HPTC users select
the tradeoff of running at full speed, and re-running jobs that are
affected by the rare case of a node failure.  Long running jobs do their
own application-specific checkpointing, which is much more efficient
than doing full-memory-image checkpointing.

-- 
Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993