[Bioclusters] BLAST/ PBS / Grid Engine

Steve Pittard bioclusters@bioinformatics.org
Fri, 17 May 2002 17:29:05 -0400 (EDT)


First the question:

Does someone know of certain combinations of load management software and
OS (e.g. PBS on Scyld or LSF on RedHAt) which have are particulalrly good
at helping one manage web based Blast submissions ? 

Now the context:

I've been offerring a web based blast service
to my local user community. Its a small emulation of what one finds
at the NCBI Blast site. We currently have an NCBI-ish web front end
for some perl scripts which perform the blast and return the results. All
in all pretty usable stuff except that demand has driven up the load
averages on my server (a 2XCPU Dell poweredge w Red HAt 7.2). Several
searches of "nr" can slow things down quite rapidly.

So I've begun experimenting with OpenPBS to smoothe the load 
on the server and keep it running well. So far so good but since
I don't have a cluster cluster yet, I haven't experimented with passing 
off jobs to other nodes.

Knowing that Blast (as distributed by NCBI) 
is not parallel I think that the best
I can do for the web based queries is to let PBS assign
the blast jobs to less busy PBS nodes to avoid the logjam.
I'm fairly certain that no load sofatware (PBS, Grid Engine,
LSF) can take Blast (or more generally any  non-parallel app) 
and spread out its CPU needs amongst the cluster. Is this 
assessment correct ? 

I realize that for batch blasting that many people "chop
up" the database over the nodes, formtdb the chunks, and
blast the queries against these chunks. Perl scripts
like disperse.pl also segment the larger Blast into more
manageable pieces. But this isn't scalable for Web queries
that might occur several times a minute. So In my situation
I have the Dbs (e.g. nr, swissprot, plant, etc ) "formatdbed" 
on a server disk with the ultimate intention of having it 
on cluster nodes perhaps with NFS over gigabit. 

RLX technologies sells an LSF based
"Blast server" which is aimed sqaurely at the "I want to blast 
thousands of sequences  at once" batch blast market though 
, again, what I'm doing is not really that since my blast requests
come in over the web on a frequent basis. But I've been working
with them a bit on my particular situation. 

Anyway I have been looking at other "proper" cluster systems 
and have been wondering which setup would best benefit 
the type of Blasting that I'm interested in. Strongly 
related to this question is the type of load management 
software to use and on what platform. I've been using PBS 
on Red Hat and so far so good but have heard good things 
about LSF and Grid Engine.

Any feedback is appreciated. Regards,

Steve Pittard	 | http://catalina.bimcore.emory.edu (HOME PAGE)
Emory University | wsp@emory.edu, wsp@bimcore.emory.edu  (INTERNET) 
BIMCORE Support	 | 404 727 0038