[Bioclusters] SGE and local output
Ivo Grosse
bioclusters@bioinformatics.org
Wed, 15 May 2002 13:34:53 -0400
Hi Joe and others,
in our case of running 30,000 Blast jobs on a 100-CPU cluster you
recommended to not write the output directly to the central file
server, but to write the output to the local node, and to collect the
output in the end in a non-random manner, in order to avoid NFS server
hickups and the like.
I love that idea, but people from Germany have the strange habit of
always trying to think of the worst possible scenario before accepting
a new idea, so here comes a set of German questions:
Assume one slave node (A) dies. I suppose that SGE will restart the
non-finished jobs X from node A on a new node B.
Question 1: Is that correect?
Assume the dead node (A) comes back to life at some point.
Question 2: Is SGE smart enough to notice that jobs X that were started
before node A went down have been restarted on node B, and is SGE smart
enough to remove the old (and useless) output of jobs X on node A?
Question 3: Alternatively, can SGE be told to try to restart jobs X on
node A after that node is back to life? How?
Question 4: If the answer to Q4 is yes, can SGE restart jobs X at the
point where they stopped, or does SGE always restart jobs from the
beginning? I mean: does SGE support checkpointing? How?
Best regards, Ivo