> anyway. I have to say that we've always gone with the distribute the > data set to all the machines anyway; NFS, or relying on caching at all, > only helps if the users are arranging their work in such a way that > takes advantage of caching, and that's not the case in my experience. Ditto. We've recently moved from using our (limited) local cluster to a more "grid" <shudder> setup where jobs run on a number of administratively distinct clusters and workflow is handled by a metascheduler. In this sort of world, I have yet to see a better solution than to decouple the data transfer from the workflow and demand that any node where a job is to be scheduled have fast access to an up to date copy of the complete set of targets. In the case of standard sequence analysis tools (BLAST, HMMER, ...), this works out to keeping about 30GB per node up to date. Large, but it doesn't have to break the bank. If your data allow it (the set of all target files fits on most local disks) then solving data sync problems independent of job scheduling makes both problems much simpler. -C