[BiO BB] Linux Clustering for Bioinformatics???

Sat Mar 10 01:49:51 EST 2001

Hi,

As I see it, clustering is the solution to tomorrow's computational
requirements, as has aptly been demonstrated by Celera, Incyte Genomics,
Google and IBM. For an introduction on Clustering and parallel processing
please see the

Beowulf-HOWTO
Parallel-Processing-HOWTO
at
http://linuxdocs.org/HOWTOs/HOWTO-INDEX/howtos.html
and

High Availability HOWTO
at
http://www.ibiblio.org/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html

If you're on a slow internet line (DSQ does have a wide channel) get hold
of a Linux CD and the Howtos will be there. You might even consider
installing Linux since nowadays Bioinformatics and Linux are going much
hand-in-hand and it's really much simpler to work on an open platform.

On Fri, 9 Mar 2001, Ayyagari Kiran wrote:

> Iam a bioinformatics programmer having my skillsets in
> Advanced Java programming and implementing three tier
> applications in bioinformatics.

I don't know Java, but probably it's threaded nature will allow some
parallelisation. I don't know if the threads can communicate within
themselves. This is required for true parallelisation.

> I worked in NT platform and have no much idea about
> Linux applications in Bioinformatics. 

I've not heard of many (any?) bioinfo tools on NT. Probably best will be
to shift to Linux or FreeBSD and keep an Irix machine handy.

> Can Anyone help me understyanding the "LINUX
> CLUSTERING TECHNOLOGY "for Bioinformatics
> applications.

Basically Clustering is of two types:

High availability: providing uninterrupted resources eg for stock markets
High Processing: provides enormous processing power

One of the High processing types is "Beowulf" ( www.beowulf.org
)clustering which is a cluster made of off the shelf commodity parts. This
allows you to join a score of PIII/Athlons to (technically) form a tiny
supercomputer. For a (debatable) list of powerful supercomputers go to
top500.org . A search on google.com will give you the names of many
clustered supercomputers.

> precisely I wanted to know,
> 
> 1) what sort of bioinfo applications (mention
> names)can run on linux clusters and why?

Some of the more famous Bioinformatics applications developed for Linux
clusters are:
NAMD ( http://www.ks.uiuc.edu/Research/namd/ )
PMD
Amber
Charmm
BLAST
FastA
(and many others)

Please search on google for a more complete list and check out these
pages:
http://zeus.polsl.gliwice.pl/~nikodem//linux4chemistry.html
http://sal.kachinatech.com/Z/2/index.shtml

some software (clustered and otherwise) was discussed about 4 months back
on the bionet.software newsgroup.

Most of the credit for Beowulf clustering goes to Don Becker. Due to the
Linux kernel's close link with the network protocols it is very easy to do
things like remote installation, and booting on diskless nodes. This
allows giant clusters to be installed and managed with ease. The robust
and secure kernel allows running jobs for months without rebooting (which
you are probably used to on NT). This is a key feature since it means that
an application can never reach the kernel memory without authorisation and
so the OS never fails. This helps since node failure will require other
nodes to wait or reschedule the task thus delaying the job. Even if a
random fraction of the nodes are continuously down, the job may never be
finished (There is some maths behind this which I don't know). Essentially
the different modules of a program are passed on to different nodes (as
processes) and the results are collated before output. Messages between
nodes are kept low to save time in network communications. Another method
is to use threading (where there is more communication between different
parts of a program) but this is generally better for an SMP system than
for a Beowulf since data bus have to be fast. Also in a Beowulf cluster
memory of each node is generally not shared. There are two popular
libraries for use in coding:
PVM (Parallel Virtual Machine)
MPI (Message Passing Interface)
though probably the MPI standards are more in use now.

> 2) How can Linux clustering be of use to application
> service providers.( people who develop linux
> applications for Bioinfo comunity)

Sooner or later all of Bioinformatics will be running on Beowulf clusters,
whether in countries with huge resources like the USA or places with
limited resources like INDIA. Only the type of nodes will vary. It is
actually very inexpensive to set up a Beowulf cluster (such clusters exist
at CDAC, very near to where you stay). Once a program has been written for
a cluster it can be run on one to hundreds of nodes with no change in the
code. Since the code has to be inherrently modular, it is more scaleable
in terms of furthur development. And of course, you can test out your code
on a supercomputer at a hundred times less cost. And for this same reason
your users and customers will be setting up clusters rather than buying
costly stand-alone machines.

HTH,
Indraneel

-- 
http://www.indialine.org