[Bioclusters] Is the "OR" job dependency useful??

Fri Jan 7 17:33:03 EST 2005

Thanks Tim for your details reply. But I am still not sure whether it is 
the right path to go. Here is my reply.

Tim Cutts wrote:
> 
> On 7 Jan 2005, at 6:46 pm, Malay wrote:
> 
>> A pipeline of any kind by nature depends on previous process.
>>
>> A -> B -> C
> 
> 
> It's not necessarily as linear as that.  You can sometimes have parallel 
> tasks.  Ascii art will probably defeat me here, but consider:
> 
> 
>   B-C
>  /   \
> A     F-G
>  \   /
>   D-E
> 
> Things like this are not uncommon.

In no batch queuing system a circular relationship can be established. 
You need to decompose this circular relation into linear one to submit 
to job queu.

> 
>> I don't understand what do you mean by jobs here. These rules can't be
>> hardcoded in scheduler, or can you?
> 
> 
> Not hardcoded in the scheduler as such, but you can tell the scheduler 
> what the dependencies are in LSF, and probably in others too.  The above 
> sets of jobs in LSF would be done with:
> 
> bsub -JA ...
> 
> bsub -JB -w'done(A)' ...
> 
> bsub -JC -w'done(B)' ...
> 
> bsub -JD -w'done(A)' ...
> 
> bsub -JE -w'done(D)' ...
> 
> bsub -JF -w'done(C) && done(E)' ...
> 
> bsub -JG -w'done(F)'

You are doing as I said decomposing the circular one into linear relation.

> 
>> In bioinformatics each of these steps is acutally not a job at all they
>> are what they called "steps". Each of these steps like A is composed of
>> 1000,000 BLAST jobs which has no dependency on each other.
> 
> 
> In LSF, multiple jobs or job arrays can be given the same name with the 
> -J parameter, and then the dependency condition applies to all jobs with 
> that name.

Agreed.

>> As I said. But do you actually suggest completing a "job" pipeline
>> before a "step" pipleline. Do you actually carry out the analyis of a
>> small reginon of genome sequence and finish it to end, or finish the
>> blast searches for the whole genome at a time?
> 
> 
> The Ensembl pipeline does a mixture of both.

>> That what I meant! The whole dependency issue is in user space, and can
>> be very well maintained my user software. In a software world,
>> unnecessary means, "thing can be managed by easier way".
> 
> 
> Yes, but that means every time someone has to write a pipeline they have 
> to write stuff to manage their own dependencies, whereas if the 
> scheduler can do it, the pipeline code the user has to write is much 
> simpler.

Here I disagree. I'll give you an example. As you are talking for LSF, I 
can tell you my own experience. If one of the LSF nodes has NFS problem 
then LSF job never returns. The job just stays there forever unless it's 
killed manually. If you don't code for timeout values there is no way 
you can relaunch the job. But doing it is far more easier  to check for 
timeout and killing and respawing it again, using a simple script. Any 
dependency rule that you mentioned will make the whole pipleline stuck. 
Atleast LSF is not as intelligent as user coding can be. It's about fine 
grain control.

> Ensembl only has its own rule manager because it is designed to be 
> independent of the batch queueing system in use.

In other words making the user's life easy.

> 
> Letting LSF get on with it is a lot simpler than having some nasty, and 
> hard to write, code which polls the scheduling system to check that the 
> previous jobs have finished before the next lot can be started.
> 

The nature of makeshift clusters are such that they are not highly 
dependable still. There are zillions of problem. As the whole system is 
not dependable, having coarse grain control of the system using a 
higher-level software is my opinion is building a could castle. May be 
one day the system will be dependable enough to use such high level 
logic. For the time being I am sticking to my script.

Malay