[Bioclusters] OpenPBS problems
   
    Donald Becker
     
    bioclusters@bioinformatics.org
       
    Tue, 2 Dec 2003 20:06:35 -0500 (EST)
    
    
  
On Tue, 2 Dec 2003, Ron Chen wrote:
> For those who are interested in use checkpointing, the
> place to start is to link your applications against a
> checkpointing library:
That doesn't address the challenge in the previous message:
checkpointing a pipeline.  The Scyld cluster system has built-in
process checkpointing (the process migration and remote fork is
implemented by checkpointing down a socket and restarting on the remote
machine) and a single cluster-wide process space with remote signal
forwarding.  But even with that core functionality, doing a checkpoint
of an arbitrary process pipeline can't be done for the general case.
> http://www.checkpointing.org/
> 
> For SGE, follow the steps here:
> 
> http://gridengine.sunsource.net/project/gridengine/howto/condorckpt.html
Condor implements checkpointing by using a special library that records
calls.  To over simplify: when it sees foofd = open("/foo"), it
remembers the path name "/foo".  While this frequently works, it can be
easily misled.  Anonymous scratch files (open() then unlink()) and
ioctl() calls are two obvious examples.
Back to the core point: to checkpoint a pipeline the in-pipe data has to
   be throttled and drained, or
   extracted and stored
This goes beyond checkpointing a single process.  And a pipeline
spanning machines is even more interesting.
-- 
Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
914 Bay Ridge Road, Suite 220		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993