[Bioclusters] cluster hardware question

16 Jan 2003 19:10:26 -0500

On Thu, 2003-01-16 at 23:26, Chris Dagdigian wrote:

> The biggest performance bottleneck in 'bioclusters' is usually disk I/O 
> throughput. Bio people tend to do lots of things that involve streaming 
> massive text and binary files through the CPU and RAM (think running a 
> blast search). The speed of your storage becomes the rate limiting 
> performance bottleneck. Often there will be terabytes of this sort of 
> data laying around so the "/data" volume is usually a NFS mount.

Note:  moving to gigabit simply moves the pain and stress to another
location in the cluster, it doesn't "solve" this problem.  You have to
be careful in choosing which problems you wish to deal with...

> If disk I/O is not your bottleneck than memory speed and size will 
> likely be next bottleneck. Some applications like blast and sequence 
> clustering algorithims will always be better off with as much physical 
> RAM as you can cram into a box. Other applications are rate-limited by 
> memory access speeds which is why Joe recommends fast DDR memory for 
> users who need high mem performance.

I tell every to think about the highly over-simplified equation for the
time it takes to move some data, and compute with it.

	T(total) = T(moving-data) + T(computing-with-data)

The T(moving-data) portion may be simply represented by

	T(moving-data) = Memory_Latency 
			+ (size_of_stuff_to_move)/memory_bandwidth.

If you go walking randomly through memory, the latency will limit you. 
If you go walking with a cache friendly memory access pattern, the
memory bandwidth will limit you.

DDR buys you good latency and good bandwidth.  RDRAM buys you ok latency
and great bandwidth.

I have heard a quote attributed to Henessey at Stanford: "You can always
buy bandwidth, but latency is forever..."

The computing time is also interesting.  The net idea is that MHz is
only part of the performance picture, and when you work on real problems
you quickly discover that it is the other aspects that are more
important than MHz.

[...]

> General rule of thumb: no matter how big, fast and expensive the storage 
> solution is you will always be able to drive it to its knees with enough 
> cheap cluster nodes hitting it. This is something that has to be lived 
> with.

I have a mantra which might be useful here:

	"Fast storage is local storage"

This is not universally true, but it is a good working assumption in
many cases.  You have costs to pay with this mode of thinking though.

With a striped basic IDE system (e.g. not working too hard on it) I can
get 70+ MB/s sustained on large sequential reads (think of reading in
these large GB sized files for analysis).  I wrote up an analysis
framework a while ago on this.  You can pull that from
http://scalableinformatics.com/scalable_fs_part_1.html

> There are faster solutions out there but they get expensive (think 
> having to run switched fibre channel to every cluster node) and are 
> often very proprietary. People in the 'super fast' storage space 
> include: DataDirect, BlueArc, Panasas, etc. etc.

Some of these designs are quite compelling BTW.  If you really need
scalable global name-space IO, you can do it.  It has some specific
advantages over local IO.  The aforementioned costs are the costs to
move data between the compute nodes.  With a global namespace, you dont
have these problems.

[...]

> A linux box with some SCSI or ATA drives serving as a NFS fileserver 
> server will cost just a few thousand dollars

A promise UltraATA card is cheap, and one of the fastest around.

> A linux box with a terabyte of NexSan ATA RAID attached via fibrechannel 
> or SCSI will cost about $12-$15,000

[...]

> > 
> >>>Also very much overlooked is the issue of cluster management.  This
> >>
> >>tends to guide the choice of Linux distribution.  Management gets to be
> >>painful after the 8th compute node, the old models don't work well on
> >>multiple system image machines.
> > 
> 
> 
> I use and love systemimager (www.systemimager.org) for automating the 
> process of managing my full-on 'install an OS on that node from scratch' 
> as well as my 'update this file or directory across all cluster nodes' 
> needs. It's a great product.

System Imager is one of several products you can use.  It is in the more
"roll your own solution" category.  This is great if you want to do
this.  There are other methods which include slightly different tools
for this work, including RedHat kickstart, and other related items. 
Kickstart has its advantages and disadvantages over other solutions. 
But it works quite nicely.

I use a variety of tools, but right now my favorite is the Linux ROCKS
distribution (http://www.rocksclusters.org) which is an enhanced RedHat
7.3 distribution.  It is one of the few cluster distributions that do
things right from the outset.  They use kickstart, and a
programmatically generated ks.cfg file.  No fiddling with it by hand.

The downside to kickstart and RH based distributions is the rather poor
disk partitioning decisions made by anaconda.  There are simple methods
to work around it (within ROCKS it is simple), but for most RH kickstart
users, the assumptions made, and the partition layouts are ... uh ...
interesting.  

[...]

> For cluster management you need to treat you compute nodes as anonymous 
> and disposable. You cannot afford to be messing with them on an 
> individual basis because your admin burden will scale linearly with 
> cluster size.

This is the philosophy behind ROCKS.

[...]

> > Joe, I know this could easily turn into a thick book :), but how does one
> > get more educated about these things?

Pay Chris (Bioteam) and I (Scalable Informatics) to come out and have a
beer with you ... it will be a rather long beer....  :)

More seriously, attend Chris' talk at O'Reilly, grab his slides.  Bug
all of us.

If there is enough interest, and people/organizations wouldn't mind
paying for this, then Bioteam/Scalable Informatics/Others might be
persuaded to organize some sort of conference call or seminar on this. 
It wouldn't be cheap, but it might be helpful to the biocluster
community.

> I'm probably biased but I've heard lots of people speak positively about 
> reading the archived threads of this mailing list. Many of the questions 
> you are asking about have been debated and discussed in the past on this 
> very mailing list. The list archives are online at 
> http://bioinformatics.org/pipermail/bioclusters/
> 
> 
> I'm also going to be rehashing alot of this stuff in a talk at the 
> upcoming OReilly Bioinformatics Technology conference. May be of 
> interest to some people and really obvious and boring to others.
> 
> 
> 
> > Another question that I am curious about is Itanium 2 and using them in a
> > cluster - any experiences with these? How about bioinformatics software -
> > any benefits in your regular programs like blast, clustalw... when running
> > on an itanium system?

I have to be careful here.

Itanium and Itanium2 give you a really nice sized memory system to play
with.  They give you a very nice memory bandwidth for sequential access.

I am not sure I can discuss the benchmarks I have run in the past.  I
think it is safe to say that I don't think they are going to push the
IA32 based machines off the performance lead (blast, clustalw, hmmer,
etc) for a while.  The G4 based machine performance was both interesting
and suprising (in a positive sense).  Granted that some compiler
assistance for the SIMD features of the chip were used, the results are
real and for similar calculations, people seem to be getting good
performance.

Interestingly, the most overlooked issue I have seen thus far, pertinant
to this discussion are the power and cooling issues.  Athlons generate
about 70-160 W /node (singles vs duals).  Xeon's generate 90-200 W/node
(please correct me if I am wrong on these, I am getting them from
totaling heat dissipation info).  Itaniums generate a bit more.  G4's
generate somewhat less.  The Hammer series will be closer to Xeon's in
power from what I have seen (in public, no NDA stuff).

You have to plan for cooling and power.  This planning exposes how
expensive these aspects can be if you are not set up for it.  Site
planning is critical.  If you can plan during the building stages,
great.  If you have to renovate an existing infrastructure, this can be
expensive.  If you can't renovate, and you have to work with what you
have, some measure of creativity will be needed to avoid problems.

More cluster nodes = more power and cooling needs.  More hotter chips =
more failures if cooling is inadequate.