[Bioclusters] Is the "OR" job dependency useful??

Tim Cutts tjrc at sanger.ac.uk
Fri Jan 7 17:05:53 EST 2005


On 7 Jan 2005, at 6:46 pm, Malay wrote:

> A pipeline of any kind by nature depends on previous process.
>
> A -> B -> C

It's not necessarily as linear as that.  You can sometimes have 
parallel tasks.  Ascii art will probably defeat me here, but consider:


   B-C
  /   \
A     F-G
  \   /
   D-E

Things like this are not uncommon.

> I don't understand what do you mean by jobs here. These rules can't be
> hardcoded in scheduler, or can you?

Not hardcoded in the scheduler as such, but you can tell the scheduler 
what the dependencies are in LSF, and probably in others too.  The 
above sets of jobs in LSF would be done with:

bsub -JA ...

bsub -JB -w'done(A)' ...

bsub -JC -w'done(B)' ...

bsub -JD -w'done(A)' ...

bsub -JE -w'done(D)' ...

bsub -JF -w'done(C) && done(E)' ...

bsub -JG -w'done(F)'

> In bioinformatics each of these steps is acutally not a job at all they
> are what they called "steps". Each of these steps like A is composed of
> 1000,000 BLAST jobs which has no dependency on each other.

In LSF, multiple jobs or job arrays can be given the same name with the 
-J parameter, and then the dependency condition applies to all jobs 
with that name.

> As I said. But do you actually suggest completing a "job" pipeline
> before a "step" pipleline. Do you actually carry out the analyis of a
> small reginon of genome sequence and finish it to end, or finish the
> blast searches for the whole genome at a time?

The Ensembl pipeline does a mixture of both.

> That what I meant! The whole dependency issue is in user space, and can
> be very well maintained my user software. In a software world,
> unnecessary means, "thing can be managed by easier way".

Yes, but that means every time someone has to write a pipeline they 
have to write stuff to manage their own dependencies, whereas if the 
scheduler can do it, the pipeline code the user has to write is much 
simpler.

Ensembl only has its own rule manager because it is designed to be 
independent of the batch queueing system in use.

Letting LSF get on with it is a lot simpler than having some nasty, and 
hard to write, code which polls the scheduling system to check that the 
previous jobs have finished before the next lot can be started.

I imagine SGE can do all this stuff too.

Tim

-- 
Dr Tim Cutts
Informatics Systems Group, Wellcome Trust Sanger Institute
GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5  860B 3CDD 3F56 E313 4233



More information about the Bioclusters mailing list