[Bioclusters] blast server on OpenMosix cluster

Mon, 5 Jan 2004 13:24:47 -0600

As far as storage goes our Blast data sets are about 60 gig on a 2.5 
terabyte XRAID.  We then exported the BlastDB via nfs to each node 
which are on their own gigabit switch.  We have not had any bottle neck 
issues yet.  But we did order fiber boards for the portal and the 
cluster nodes if we need to we'll get a fiber switch and that should 
take care of that issue.  or XRAID has dual 2 gigabit fiber ports.

Also we have the user shares on there too.  Plus we moved our Linux 
home directors over to our XRAID and our xserve exports them directly 
to the linux servers.  Its a pretty nice setup.
So now our bioinformatics whenever they log into any server they are 
always seeing the same directory no matter what Unix(mac os X, or our 
SGI's) Linux or even "Wintel" if they really want that, lol ya I'll 
take the insecure system with Virus issues because the code is so crap. 
  hehe

I did mirror the XRAID and set up a hot swappable drive for each set on 
the RAID.  So the total available space is about 900 gigs.

John

On Jan 5, 2004, at 12:02 PM, Chris Dagdigian wrote:

> {hmmm. I think this is still on-topic for the list at large...}
>
> Hi Hong Zhang,
>
> One of our clusters should be very close by to you at Dana-Farber 
> Cancer Institute depending on what building you are in...I forgot to 
> add that one to the list of Harvard-affiliated systems that I knew 
> about.
>
> On to your questions about Grid Engine (SGE) and blast;
>
> 1. Nothing about SGE will force you to have a single-application 
> cluster unless you choose to use it that way. Some people desire 
> 'appliance' type systems that are designed and specially tuned to run 
> a single application really, really well. Other groups of researchers 
> want a general purpose system that can run all sorts of applications.
>
> Configuring a cluster to run BLAST really well is a nice target for 
> informatics researchers since BLAST tends to beat heavily on memory 
> and storage subsystems. Optimizing for blast tends to mean that the 
> cluster will stand up well to other sorts of informatics workloads.
>
> Making applications run on clusters -- as a general rule, if you can 
> run a program on the Unix command line it is very easy to set things 
> up so that the same program can be run across a cluster or compute 
> farm while under the control of Grid Engine. It gets more complicated 
> if the program requires a specialized environment, a license server or 
> if the application requires a parallel MPI or PVM environment.
>
> I can't tell you how easy it would be to make your other WWW tools 
> 'SGE aware' but in general the process is similar to what you would 
> have to do to cluster-enable the www-blast CGI'code. In most cases a 
> simple wrapper script will do the job.
>
> Space for blast data is another hard to answer question. There are 
> bioclusters in the Boston area that have terabyte-scale storage arrays 
> serving up hundreds of gigabytes of blastable databases and there are 
> others that just need a few gigs of disk space to store the particular 
> NCBI datasets that they care about.
>
> To figure out what space you need; list what databases you'd like to 
> have available and document what the (uncompressed) file sizes are. 
> Then take that number and triple it (or more) because you'll need 
> space to build, uncompress and curate your datasets as well as handle 
> normal growth.
>
> Most people can easily store their favorite blast databases within a 
> single IDE or SATA disk drive these days. Because blast is rate 
> limited by disk performance these drives are often mirrored with 
> hardware or software RAID in pairs of two or more. Searching BLAST 
> datasets across an NFS share can be a big performance bottleneck so 
> many people will install the mirrored-disk pairs in each of their 
> blast compute nodes so that all blast databases are replicated on 
> local storage. This removes a ton of NFS traffic from the cluster 
> network although the extra work of making sure that all your big blast 
> files are _correctly_ replicated across many nodes can be time 
> consuming.
>
> If I was building a Linux blast cluster node from scratch today I'd 
> use pairs of the 160gb Seagate SATA drives mirrored with software 
> RAID. The big computer vendors may not be as flexible with IDE storage 
> offerings but they'd at least have products using disks in the 
> 80-120gb range which should be fine for your needs. You'd want 
> hardware RAID and more redundancy in your cluster 'master' or head 
> node but in general the compute nodes are disposable so a simple 
> software RAID mirror on inexpensive disks is all you need for the 
> worker nodes.
>
> If I was building an Apple G4 or G5 cluster I'd wait until the end of 
> the day tomorrow to see what product announcements come out at 
> MacWorld!
>
> -Chris
>
>
> hong.zhang@research.dfci.harvard.edu wrote:
>
>> Hi Chris,
>> Thanks for your message. It is really encourageable. My further 
>> question is
>> we have other www tools rather than wwwblast installed on the cluster 
>> so
>> whether SGE makes all tooks migratable or just a single-job cluster (i
>> mean only for blast such as mpiblast).
>> And also how much space is needed to host blast data?
>
>
>>> Hong Zhang,
>>>
>>> There are several clusters doing Blast and Blast over WWW at Harvard.
>>> Contact me in private if you want contact information for the people
>>> running them.
>>>
>>> The Bauer Center for Genomics Research has a big cluster system 
>>> running
>>> Platform LSF. (http://cgr.harvard.edu)
>>>
>>> The Harvard Stats department over in the Science Center is running 
>>> Grid
>>> Engine on a small Linux cluster.
>>>
>>> The Flybase project people are using Grid Engine on Mac OS X (apple
>>> Xserves) for some lightweight web bioinformatics portal stuff
>>> (http://inquiry.flybase.harvard.edu)
>>>
>>> There are several more systems I've heard about or visited over at 
>>> the
>>> Medical school etc.
>>>
>>> Regarding your questions:
>>>
>>> 1. wwwblast servers are easy to set up on clusters. For a lightweight
>>> system you can just take the LSF 'lsrun' or Grid Engine 'qrsh' 
>>> commands
>>> and use them to wrap the call to the blastall executable. This will 
>>> not
>>> work in a large setting as qrsh/lsrun will fail silently if there 
>>> are no
>>> resources available; in that case you need to go asynchronous and get
>>> used to the batch system.
>>>
>>> 2. SGE easily runs on Debian linux
>>>
>>> Regards,
>>> Chris
>>>
>>>
>>>
>>>
>>> Hong Zhang wrote:
>>>
>>>
>>>> Thanks for your information. I read the article before.
>>>> I'd like to know
>>>> 1. whether it is possible to set up a wwwblast server on
>>>> cluster. Our goal is allow users to access blast database through 
>>>> web
>>>> page  instead of command line. I am not sure whether query from web
>>>> page can be  migrated.
>>>>
>>>> 2. whether SGE can be used in Debian.
>>>>
>>>>
>>>> On Fri, 2 Jan 2004, Ron Chen wrote:
>>>>
>>>>
>>>>
>>>>> It takes time to let openmosix to migrate your jobs.
>>>>> SGE is more suitable in the compute farm environment.
>>>>>
>>>>> "Integrating BLAST with Sun ONE Grid Engine Software"
>>>>> available at:
>>>>> http://developers.sun.com/solaris/articles/integrating_blast.html
>>>>>
>>>>> -Ron
>>>>>
>>>>> --- Hong Zhang <hzhang@research.dfci.harvard.edu>
>>>>> wrote:
>>>>>
>>>>>
>>>>>> But I have trouble make blast command line execute
>>>>>> in every node.
>>>>>>
>>>>>> And don't you think openmosix is suitable for blast
>>>>>> cluster? You suggested
>>>>>> SGE?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, 11 Dec 2003, Farul Mohd. Ghazali wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Wed, 10 Dec 2003
>>>>>>
>>>>>> hong.zhang@research.dfci.harvard.edu wrote:
>>>>>>
>>>>>>
>>>>>>>> I am working on set up a blast server on
>>>>>>
>>>>>> Debian/OpenMosix cluster with 4
>>>>>>
>>>>>>
>>>>>>>> nodes. Actually it is totally new to me. So is
>>>>>>
>>>>>> there anyone can give me
>>>>>>
>>>>>>
>>>>>>>> some advice? Thanks.
>>>>>>>
>>>>>>> I've used OpenMosix in the form of ClusterKnoppix
>>>>>>
>>>>>> some months back to test
>>>>>>
>>>>>>
>>>>>>> it out. The setup was very easy, boot off the CD,
>>>>>>
>>>>>> configure some settings
>>>>>>
>>>>>>
>>>>>>> and the rest of the nodes boot off the network.
>>>>>>
>>>>>> Applications are
>>>>>>
>>>>>>
>>>>>>> automatically load balanced across nodes.
>>>>>>>
>>>>>>> While configuration and actual use was very easy,
>>>>>>
>>>>>> performance wasn't too
>>>>>>
>>>>>>
>>>>>>> great. I think the main reason was that OpenMosix
>>>>>>
>>>>>> dynamically migrates
>>>>>>
>>>>>>
>>>>>>> applications to the different nodes to
>>>>>>
>>>>>> automatically load balance the
>>>>>>
>>>>>>
>>>>>>> system thus the overhead of migration for long
>>>>>>
>>>>>> running jobs suddenly
>>>>>>
>>>>>>
>>>>>>> became apparent.
>>>>>>>
>>>>>>> To be honest, we didn't try to optimize it much
>>>>>>
>>>>>> and went to implement our
>>>>>>
>>>>>>
>>>>>>> blast cluster with SGE and hopefully soon
>>>>>>
>>>>>> mpiblast.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioclusters maillist  -
>>>>>>
>>>>>> Bioclusters@bioinformatics.org
>>>>>>
>>>>>
>>>>> https://bioinformatics.org/mailman/listinfo/bioclusters
>>>>>
>>>>>
>>>>>> --
>>>>>> Hong Zhang, MIS
>>>>>> Bioinformatics Analyst
>>>>>> Dana Farber Cancer Institute
>>>>>> Harvard Medical School
>>>>>> 44 Binney St, D1510A
>>>>>> Boston MA 02115
>>>>>> Email: hong.zhang@research.dfci.harvard.edu
>>>>>> Phone: 617-632-3824
>>>>>> Fax: 617-632-3351
>
>
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>
John F. Lewis III
www.danforthcenter.org
jlewis@danforthcenter.org
314-587-1028