[Bioclusters] Clusters for bioinformatics... Some numbers or statistics?

Thu, 30 Aug 2001 00:03:40 -0400

Hi Claris,

I haven't seen any replies to this come across this list so I'll try to 
chime in here on your questions, not sure how much help I can be. I'd love 
to hear what others have to say as well.

In general I have yet to see any really comprehensive (public) statistics 
or even actual case studies of cluster use in bioinformatics applications. 
In many cases people have been building cluster systems to solve a very 
specific computational need and their primary interest is the data 
generated, not the system or its optimized architecture/benchmark 
characteristics. This is starting to change now as people are applying 
their experiences with testbed or single-application clusters and are now 
building clustered systems to support general life science informatics 
research.

For popular algorithms like BLAST/HMMER etc. etc. that fall into the 
"embarrassingly parallel" category the benefits of clusters are immediate 
and measurable as many people on this list can no doubt confirm. With a 
dozen or so dual-CPU intel linux boxes and a good fileserver one can easily 
build a high throughput BLAST farm that will blow away "enterprise" unix 
systems that are sold to companies for  many hundreds of thousands of 
dollars. This is generally how many life science people get into 
clustering- they need to reduce the burden / free up resources on the 
expensive Sun/Alpha/SGI servers so they build a cheap cluster to soak up 
whatever workload they can throw at it.

There may be people on this list who are willing to share with you some 
concrete performance metrics from the systems they have built. The groups 
that build the big/expensive clusters generally have to product some fairly 
good benchmarks and case studies to justify themselves to the budget people 
so I'm sure such documents exist.

I would be very cautious in putting too much weight into any public 
bioinformatics cluster benchmarks that you may run across (unless you are 
intending to exactly copy the architecture). There are way too many 
variables in such systems and often times it turns out that things like 
RAM, network bandwidth and disk I/O are the rate limiting performance 
bottlenecks. There just is no "standard" way to do this type of work so 
benchmarks as a means of comparison are going to be fairly meaningless 
because the hardware and architecture approaches are likely to differ wildly.

Back to your other question;

There are few (if any) parallel applications that are widely used for 
sequence analysis and basic bioinformatics. About the only program I can 
think of is FASTA which can support running in a PVM environment. There may 
be some assembly programs that run in true parallel mode as well (I'm not 
sure).

{Anyone have a good journal reference for an article that reviews this area?}

Instead of running a single parallel application to handle thousands or 
hundreds of thousands of sequence analysis operations what you end up doing 
is invoking (and potentially distributing) many separate instances of the 
non-parallel algorithm each with different command line arguments (input 
sequence, target database, threshold cutoffs, etc). This is the type of 
workflow that runs beautifully on a commodity cluster architecture as there 
are lots of software suites available that handle batch scheduling & 
distributed load management.

There are many more parallel applications in use as you get away from 
sequence analysis and start doing things like molecular modelling, virtual 
screening, protein structure prediction etc. These are also the areas where 
you will actually see commercial companies like Accelrys/MSI selling 
parallel software products into the research community.

Many people developing their own in-house code and proprietary algorithms 
are thinking about parallel processing on clusters. The level of interest 
I've seen in high speed system interconnects like Myrinet and Dolphin SCI 
has risen significantly over the past year or so.

In summary:

  o benchmarks and case studies are very hard to find or are treated as 
confidential
  o life science clusters tend to be used to bulk process many totally 
independent "embarrassingly parallel" jobs
  o little if any use of parallel applications for sequence analysis
  o some areas of genomics/proteomics/{insert buzzword here} are using 
parallel code (commercial and non-commercial)

My $.02

-Chris

At 11:16 PM 8/27/01 -0500, you wrote:
>I am working on a project called Scientific Computing Strategy for the 
>Smithsonian Tropical Research Institute and I was wondering if I can get 
>any statitics about using cluster in bioinformatics. I mean information 
>like how much the time factor has improved in some applications like 
>sequence alignment  (multiple) using clusters, do we have software for 
>parallel computing in the field (BLAST cluster version?), etc. Where can I 
>get this information? Any idea?
>Thanks in advance,
>Claris

-- 
Chris Dagdigian  (Home:Work) Blackstone Computing
dag@sonsorol.org      : dagdigian@blackstonecomputing.com
www.open-bio.org      : www.blackstonecomputing.com