[Bioclusters] Versions of Blast that run on a cluster?

Wed Jan 5 16:57:11 EST 2005

On 5 Jan 2005, at 6:39 pm, Bernard Li wrote:

> Hi Malay:
>
> Are there any documentations and/or papers which describe such a setup?
> I would assume that there would be general interest in seeing how such 
> a
> setup could be implemented.

You could look at the code for the Ensembl pipeline, which uses a 
similar technique, but it's much more general, and handles a large 
number of algorithms, not just BLAST.

The scientist creates a configuration file which describes the analyses 
to be performed (BLAST, RepeatMasker, genewise, whatever), and runs a 
perl script known as the RuleManager.  The RuleManager works out how 
many jobs need to be run and in what order, and populates a database 
table with that information.  It then submits the appropriate number of 
jobs to the batch queueing system.  These jobs don't inherently know 
what they have to do; as they land on an execution host they query the 
MySQL database for their particular work unit, execute the analyses in 
their work unit, against local BLAST databases duplicated on each node 
-- we do use cluster filesystems on some groups of nodes now as well, 
but we *don't* use NFS, it's totally broken at this scale -- and when 
the analyses are completed, each job writes its result features back 
into the MySQL database, from where the web server component of Ensembl 
can extract them and display them to the end user.

All of this code can be downloaded from the Ensembl CVS repository, 
instructions at www.ensembl.org.

This pipeline code scales well up to around a few hundred machines; 
after that database contention becomes the limiting factor.  Other 
pipeline architectures that are less database intensive are being 
developed by the teams at the EBI and Sanger (and can actually be found 
on the same CVS repository)

Tim

-- 
Dr Tim Cutts
Informatics Systems Group, Wellcome Trust Sanger Institute
GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5  860B 3CDD 3F56 E313 4233