[Bioclusters] Call for information.

Chris Dagdigian bioclusters@bioinformatics.org
Tue, 16 Apr 2002 09:13:22 -0400

Some of the justifications will depend on your audience; are they IT 
people who control your hardware budget or are they senior scientists 
who need to approve new research computing directives etc.

Here are some of the justifications I've used for both types of groups:

(1) Bioclusters preserve any current investment you may have in big 
expensive unix SMP machines by significantly reducing the computational 
load on your legacy hardware. Basically you use the large memory and big 
SMP systems for things like EST clustering and data warehouses that need 
such environments and you offload everything else you can onto piles of 
cheap mass market hardware. I know several companies who were able to 
postpone or actually cancel plans to upgrade or replace large Sun, Alpha 
and SGI machines because they were able to extend the useful server life 
by migrating load to the far cheaper cluster or compute farm. Not having 
to replace or upgrade one of those large systems can save hundreds of 
thousands or even millions of dollars in capital expense.

(2) Fine grained scaling on demand. In a biocluster it is trivial to add 
additional CPUS. As long as your architcure is correct you can 
incrementally scale easily and cheaply from tens of CPUs to hundreds or 
thousands. Compare and contrast this to the problem of upgrading a large 
unix machine. That 64-CPU enterprise unix system may be great but what 
happens when you need that 65th CPU? It may require purchase of a whole 
new cabinet and expensive interconnects just to get that next processor 
fired up. The other nice thing about scaling with bioclusters is that it 
is easy to take advantage of newer and faster hardware. Load management 
layers like LSF, PBS etc. can trivially handle heterogeneous hardware 
environments so it is not a problem to have your cluster composed of 
different machine classes. This allows you to effeciently purchase the 
fastest available commodity CPU power each year with little waste. Plus 
if you work the proper magic with the load management software layer 
your end users will never know or have to understand the back end. ALl 
they know is that their jobs get done.

(3) For high throughput embarassingly parallel situations like massive 
BLAST & hmmsearch searches etc. etc. a biocluster will blow away any 
enterprise unix system you can think of. As a concrete example of this 
when I was at Blackstone Computing we were able to build a 
proof-of-concept dedicated Blast farm with $30,000 USD worth of 
commodity hardware.

That $30,000 demo blast farm was tested by the customer (a large pharma 
company) and was found to be significantly faster than the $300,000 + 
unix system they were currently using. The system was so fast 
(throughtput, not turnaround) the customer was able to perform 
calculations and experiments that had not been possible before due to 
time and horsepower constraints.

This (#3) is the primary reason that I see people building bioclusters. 
THe know that they have a huge requirement to run lots of conveniently 
embarassingly parallel applications in a high throughput mode. As it 
turns out a loosely coupled cluster or compute farm tends to be a really 
nice and effective platform for doing this. Many of the first 
"bioclusters" were actually dedicated BLAST, genescan, hmm etc. 
resources although these days they are being used for more.

(4) Linux on commercial mass market hardware is _incredibly_ powerful 
from a price/performance standpoint. The Intel/AMD cpus are amazing. If 
you have a software application or algorithim that runs well under Linux 
and you need to run lots of them then a cluster is a great choice.

(5) What it comes down to is that leveraging piles of inexpensive 
commodity hardware is the only cost effective way that life science 
researchers can really get the flexible "supercomputer scale" CPU power 
they need to perform their work.

(6) A hell of a lot of bioinformatics software development is now being 
primarily developed or ported to linux-on-i386.

I do have some links that may be useful; particularly Matthew Trunnel's 
article in scientific computing world but I don't have the URLs handy 
and I need to run out to a meeting :) I'll follow up with URLs when I 
get back.

Anyone else with comments?


Paul Gardner wrote:

> Hi All,
> I have to give a talk on thurs 12pm (NZST) that justifies the expense of
> purchasing 128 PentiumIVs for a BioCluster at our weekly Research Group
> meeting.
> I already know a bit about using the MPI compiler and PBS queuing system.
> What I'm really interested in is the solutions BioClusters are currently
> being used for. Any URLs, papers, and/or suggestions would be greatly
> appreciated.

Chris Dagdigian, <dag@sonsorol.org>
Independent life science IT & research computing consulting
Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193
Work: http://BioTeam.net PGP KeyID: 83D4310E  Yahoo IM: craffi