[Bioclusters] NCBI database download and format code

Chris Dagdigian bioclusters@bioinformatics.org
Fri, 02 May 2003 09:29:35 -0400


Jeremy Mann wrote:

> Then how would you tell blastall which nodes have which *piece* of the
> database?
> 

Depends. If all your nodes have all the pieces then you just submit 
multiple blastall searches to your cluster, each search specifying only 
the database segment you want to query against. Easy. The harder part is 
getting the multiple responses back and merging them into something 
sensible.

If your nodes do not have all the fragments on hand then you don't tell 
blastall. You tell your cluster load management system (PBS, GridEngine, 
LSF) etc. to run your searches on a specific machine, queue or 
consumable/static resource. There are lots of ways to do this -- you can 
manually tell GridEngine or LSF to run job X on host Y or you can make 
this a bit more abstract by making your cluster job scheduler aware of 
which nodes have which pieces. This can be done by configuring LSF or 
GridEngine with custom static or dynamic resource attributes. Once that 
is done you can tell LSF for instance to "run this blast job on any 
machine in my cluster that has the attribute NCBI-GENBANK-PART-1 set to 
'true' " etc. etc.

Back in my Blackstone Computing days we had a cool solution to this 
called smartcache. We basically added "data aware" scheduling 
capabilities to LSF or GridEngine. The end result was that the scheduler 
"knew" where the database pieces were and could allocate jobs 
accordingly to the proper machine or queue
.

-Chris


-- 
Chris Dagdigian, <dag@sonsorol.org>
BioTeam Inc. - Independent Bio-IT & Informatics consulting
Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193
PGP KeyID: 83D4310E Yahoo IM: craffi Web: http://bioteam.net