On 5 Jan 2005, at 6:39 pm, Bernard Li wrote: > Hi Malay: > > Are there any documentations and/or papers which describe such a setup? > I would assume that there would be general interest in seeing how such > a > setup could be implemented. You could look at the code for the Ensembl pipeline, which uses a similar technique, but it's much more general, and handles a large number of algorithms, not just BLAST. The scientist creates a configuration file which describes the analyses to be performed (BLAST, RepeatMasker, genewise, whatever), and runs a perl script known as the RuleManager. The RuleManager works out how many jobs need to be run and in what order, and populates a database table with that information. It then submits the appropriate number of jobs to the batch queueing system. These jobs don't inherently know what they have to do; as they land on an execution host they query the MySQL database for their particular work unit, execute the analyses in their work unit, against local BLAST databases duplicated on each node -- we do use cluster filesystems on some groups of nodes now as well, but we *don't* use NFS, it's totally broken at this scale -- and when the analyses are completed, each job writes its result features back into the MySQL database, from where the web server component of Ensembl can extract them and display them to the end user. All of this code can be downloaded from the Ensembl CVS repository, instructions at www.ensembl.org. This pipeline code scales well up to around a few hundred machines; after that database contention becomes the limiting factor. Other pipeline architectures that are less database intensive are being developed by the teams at the EBI and Sanger (and can actually be found on the same CVS repository) Tim -- Dr Tim Cutts Informatics Systems Group, Wellcome Trust Sanger Institute GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5 860B 3CDD 3F56 E313 4233