Hi Joe and others, in our case of running 30,000 Blast jobs on a 100-CPU cluster you recommended to not write the output directly to the central file server, but to write the output to the local node, and to collect the output in the end in a non-random manner, in order to avoid NFS server hickups and the like. I love that idea, but people from Germany have the strange habit of always trying to think of the worst possible scenario before accepting a new idea, so here comes a set of German questions: Assume one slave node (A) dies. I suppose that SGE will restart the non-finished jobs X from node A on a new node B. Question 1: Is that correect? Assume the dead node (A) comes back to life at some point. Question 2: Is SGE smart enough to notice that jobs X that were started before node A went down have been restarted on node B, and is SGE smart enough to remove the old (and useless) output of jobs X on node A? Question 3: Alternatively, can SGE be told to try to restart jobs X on node A after that node is back to life? How? Question 4: If the answer to Q4 is yes, can SGE restart jobs X at the point where they stopped, or does SGE always restart jobs from the beginning? I mean: does SGE support checkpointing? How? Best regards, Ivo