[Bioclusters] Daemonizing blast, ie running many sequences through 1 process

Chris Dwan (CCGB) bioclusters@bioinformatics.org
Fri, 7 Nov 2003 10:50:39 -0600 (CST)


> anyway.  I have to say that we've always gone with the distribute the
> data set to all the machines anyway; NFS, or relying on caching at all,
> only helps if the users are arranging their work in such a way that
> takes advantage of caching, and that's not the case in my experience.

Ditto.

We've recently moved from using our (limited) local cluster to
a more "grid" <shudder> setup where jobs run on a number of
administratively distinct clusters and workflow is handled by a
metascheduler.

In this sort of world, I have yet to see a better solution than to
decouple the data transfer from the workflow and demand that any
node where a job is to be scheduled have fast access to an up to date copy
of the complete set of targets.  In the case of standard sequence
analysis tools (BLAST, HMMER, ...), this works out to keeping about 30GB
per node up to date.  Large, but it doesn't have to break the bank.

If your data allow it (the set of all target files fits on most local
disks) then solving data sync problems independent of job scheduling makes
both problems much simpler.

-C