[Bioclusters] blast server on OpenMosix cluster

Chris Dagdigian bioclusters@bioinformatics.org
Mon, 05 Jan 2004 13:02:26 -0500


{hmmm. I think this is still on-topic for the list at large...}

Hi Hong Zhang,

One of our clusters should be very close by to you at Dana-Farber Cancer 
Institute depending on what building you are in...I forgot to add that 
one to the list of Harvard-affiliated systems that I knew about.

On to your questions about Grid Engine (SGE) and blast;

1. Nothing about SGE will force you to have a single-application cluster 
unless you choose to use it that way. Some people desire 'appliance' 
type systems that are designed and specially tuned to run a single 
application really, really well. Other groups of researchers want a 
general purpose system that can run all sorts of applications.

Configuring a cluster to run BLAST really well is a nice target for 
informatics researchers since BLAST tends to beat heavily on memory and 
storage subsystems. Optimizing for blast tends to mean that the cluster 
will stand up well to other sorts of informatics workloads.

Making applications run on clusters -- as a general rule, if you can run 
a program on the Unix command line it is very easy to set things up so 
that the same program can be run across a cluster or compute farm while 
under the control of Grid Engine. It gets more complicated if the 
program requires a specialized environment, a license server or if the 
application requires a parallel MPI or PVM environment.

I can't tell you how easy it would be to make your other WWW tools 'SGE 
aware' but in general the process is similar to what you would have to 
do to cluster-enable the www-blast CGI'code. In most cases a simple 
wrapper script will do the job.

Space for blast data is another hard to answer question. There are 
bioclusters in the Boston area that have terabyte-scale storage arrays 
serving up hundreds of gigabytes of blastable databases and there are 
others that just need a few gigs of disk space to store the particular 
NCBI datasets that they care about.

To figure out what space you need; list what databases you'd like to 
have available and document what the (uncompressed) file sizes are. Then 
take that number and triple it (or more) because you'll need space to 
build, uncompress and curate your datasets as well as handle normal growth.

Most people can easily store their favorite blast databases within a 
single IDE or SATA disk drive these days. Because blast is rate limited 
by disk performance these drives are often mirrored with hardware or 
software RAID in pairs of two or more. Searching BLAST datasets across 
an NFS share can be a big performance bottleneck so many people will 
install the mirrored-disk pairs in each of their blast compute nodes so 
that all blast databases are replicated on local storage. This removes a 
ton of NFS traffic from the cluster network although the extra work of 
making sure that all your big blast files are _correctly_ replicated 
across many nodes can be time consuming.

If I was building a Linux blast cluster node from scratch today I'd use 
pairs of the 160gb Seagate SATA drives mirrored with software RAID. The 
big computer vendors may not be as flexible with IDE storage offerings 
but they'd at least have products using disks in the 80-120gb range 
which should be fine for your needs. You'd want hardware RAID and more 
redundancy in your cluster 'master' or head node but in general the 
compute nodes are disposable so a simple software RAID mirror on 
inexpensive disks is all you need for the worker nodes.

If I was building an Apple G4 or G5 cluster I'd wait until the end of 
the day tomorrow to see what product announcements come out at MacWorld!

-Chris


hong.zhang@research.dfci.harvard.edu wrote:

> Hi Chris,
> Thanks for your message. It is really encourageable. My further question is
> we have other www tools rather than wwwblast installed on the cluster so
> whether SGE makes all tooks migratable or just a single-job cluster (i
> mean only for blast such as mpiblast).
> 
> And also how much space is needed to host blast data?
> 


>>Hong Zhang,
>>
>>There are several clusters doing Blast and Blast over WWW at Harvard.
>>Contact me in private if you want contact information for the people
>>running them.
>>
>>The Bauer Center for Genomics Research has a big cluster system running
>>Platform LSF. (http://cgr.harvard.edu)
>>
>>The Harvard Stats department over in the Science Center is running Grid
>>Engine on a small Linux cluster.
>>
>>The Flybase project people are using Grid Engine on Mac OS X (apple
>>Xserves) for some lightweight web bioinformatics portal stuff
>>(http://inquiry.flybase.harvard.edu)
>>
>>There are several more systems I've heard about or visited over at the
>>Medical school etc.
>>
>>Regarding your questions:
>>
>>1. wwwblast servers are easy to set up on clusters. For a lightweight
>>system you can just take the LSF 'lsrun' or Grid Engine 'qrsh' commands
>>and use them to wrap the call to the blastall executable. This will not
>>work in a large setting as qrsh/lsrun will fail silently if there are no
>> resources available; in that case you need to go asynchronous and get
>>used to the batch system.
>>
>>2. SGE easily runs on Debian linux
>>
>>Regards,
>>Chris
>>
>>
>>
>>
>>Hong Zhang wrote:
>>
>>
>>>Thanks for your information. I read the article before.
>>>I'd like to know
>>>1. whether it is possible to set up a wwwblast server on
>>>cluster. Our goal is allow users to access blast database through web
>>>page  instead of command line. I am not sure whether query from web
>>>page can be  migrated.
>>>
>>>2. whether SGE can be used in Debian.
>>>
>>>
>>> On Fri, 2 Jan 2004, Ron Chen wrote:
>>>
>>>
>>>
>>>>It takes time to let openmosix to migrate your jobs.
>>>>SGE is more suitable in the compute farm environment.
>>>>
>>>>"Integrating BLAST with Sun ONE Grid Engine Software"
>>>>available at:
>>>>http://developers.sun.com/solaris/articles/integrating_blast.html
>>>>
>>>>-Ron
>>>>
>>>>--- Hong Zhang <hzhang@research.dfci.harvard.edu>
>>>>wrote:
>>>>
>>>>
>>>>>But I have trouble make blast command line execute
>>>>>in every node.
>>>>>
>>>>>And don't you think openmosix is suitable for blast
>>>>>cluster? You suggested
>>>>>SGE?
>>>>>
>>>>>
>>>>>
>>>>>On Thu, 11 Dec 2003, Farul Mohd. Ghazali wrote:
>>>>>
>>>>>
>>>>>
>>>>>>On Wed, 10 Dec 2003
>>>>>
>>>>>hong.zhang@research.dfci.harvard.edu wrote:
>>>>>
>>>>>
>>>>>>>I am working on set up a blast server on
>>>>>
>>>>>Debian/OpenMosix cluster with 4
>>>>>
>>>>>
>>>>>>>nodes. Actually it is totally new to me. So is
>>>>>
>>>>>there anyone can give me
>>>>>
>>>>>
>>>>>>>some advice? Thanks.
>>>>>>
>>>>>>I've used OpenMosix in the form of ClusterKnoppix
>>>>>
>>>>>some months back to test
>>>>>
>>>>>
>>>>>>it out. The setup was very easy, boot off the CD,
>>>>>
>>>>>configure some settings
>>>>>
>>>>>
>>>>>>and the rest of the nodes boot off the network.
>>>>>
>>>>>Applications are
>>>>>
>>>>>
>>>>>>automatically load balanced across nodes.
>>>>>>
>>>>>>While configuration and actual use was very easy,
>>>>>
>>>>>performance wasn't too
>>>>>
>>>>>
>>>>>>great. I think the main reason was that OpenMosix
>>>>>
>>>>>dynamically migrates
>>>>>
>>>>>
>>>>>>applications to the different nodes to
>>>>>
>>>>>automatically load balance the
>>>>>
>>>>>
>>>>>>system thus the overhead of migration for long
>>>>>
>>>>>running jobs suddenly
>>>>>
>>>>>
>>>>>>became apparent.
>>>>>>
>>>>>>To be honest, we didn't try to optimize it much
>>>>>
>>>>>and went to implement our
>>>>>
>>>>>
>>>>>>blast cluster with SGE and hopefully soon
>>>>>
>>>>>mpiblast.
>>>>>
>>>>>
>>>>>>
>>>>>>_______________________________________________
>>>>>>Bioclusters maillist  -
>>>>>
>>>>>Bioclusters@bioinformatics.org
>>>>>
>>>>
>>>>https://bioinformatics.org/mailman/listinfo/bioclusters
>>>>
>>>>
>>>>>--
>>>>>Hong Zhang, MIS
>>>>>Bioinformatics Analyst
>>>>>Dana Farber Cancer Institute
>>>>>Harvard Medical School
>>>>>44 Binney St, D1510A
>>>>>Boston MA 02115
>>>>>Email: hong.zhang@research.dfci.harvard.edu
>>>>>Phone: 617-632-3824
>>>>>Fax: 617-632-3351