[Bioclusters] SGE and local output

Wed, 15 May 2002 13:34:53 -0400

Hi Joe and others,

in our case of running 30,000 Blast jobs on a 100-CPU cluster you 
recommended to not write the output directly to the central file 
server, but to write the output to the local node, and to collect the 
output in the end in a non-random manner, in order to avoid NFS server 
hickups and the like.

I love that idea, but people from Germany have the strange habit of 
always trying to think of the worst possible scenario before accepting 
a new idea, so here comes a set of German questions:

Assume one slave node (A) dies.  I suppose that SGE will restart the 
non-finished jobs X from node A on a new node B.

Question 1: Is that correect?

Assume the dead node (A) comes back to life at some point.

Question 2: Is SGE smart enough to notice that jobs X that were started 
before node A went down have been restarted on node B, and is SGE smart 
enough to remove the old (and useless) output of jobs X on node A?

Question 3: Alternatively, can SGE be told to try to restart jobs X on 
node A after that node is back to life?  How?

Question 4: If the answer to Q4 is yes, can SGE restart jobs X at the 
point where they stopped, or does SGE always restart jobs from the 
beginning?  I mean: does SGE support checkpointing?  How?

Best regards, Ivo