[Bioclusters] Is the "OR" job dependency useful??

Fri Jan 7 13:46:13 EST 2005

Tim Cutts wrote:
> 
> On 6 Jan 2005, at 5:49 pm, Malay wrote:
> 
>> Rayson Ho wrote:
>>
>>> Gridengine currently has the "AND" operator job dependency:
>>> A,B -> C
>>> ie. we need to wait for job A and B finish before we start job C.
>>> There are discussions on the SGE dev mailing list about adding the OR
>>> job dependency:
>>> A|B -> C
>>> So job C will start as soon as job A or job B finishes.
>>> I am wondering if this is useful in bioinformatics job flows??
>>
>>
>> As far as bioinformatics goes I am afraid most of the bioinformatics 
>> applications are embarassingly independant :) Although such dependancy 
>> resolution issues will have it's niche application but I guess it's 
>> very limited as far as bioinformatics goes.
> 
> 
> I don't think that's true - when you consider something like a gene 
> annotation process, there are lots of dependencies.  Consider what goes 
> on with Ensembl; before any analyses are performed, the sequences have 
> to be dusted and RepeatMasked.  After that raw features such as blast 
> hits, ab initio gene predictors and EST alignments can be calculated.  
> Once the BLAST hits have been done, genewise alignments can be performed 
> (using the BLAST results to narrow down the areas genewise needs to 
> analyse). Only once the EST alignments, ab initio predictors and 
> genewise are complete can the code be run to combine these into a 
> coherent set of gene structures.

A pipeline of any kind by nature depends on previous process.

A -> B -> C

I don't understand what do you mean by jobs here. These rules can't be
hardcoded in scheduler, or can you?

In bioinformatics each of these steps is acutally not a job at all they
are what they called "steps". Each of these steps like A is composed of
1000,000 BLAST jobs which has no dependency on each other.

> 
> Although each of these processes consists of thousands of independent 
> jobs, each type of analysis is dependent on the completion of the 
> previous ones.

As I said. But do you actually suggest completing a "job" pipeline
before a "step" pipleline. Do you actually carry out the analyis of a
small reginon of genome sequence and finish it to end, or finish the
blast searches for the whole genome at a time?

> As it happens, all of these dependencies are handled in the Ensembl 
> RuleManager rather than by the scheduling system.

That what I meant! The whole dependency issue is in user space, and can
be very well maintained my user software. In a software world,
unnecessary means, "thing can be managed by easier way".

> They're all AND dependencies as far as I can tell, and I've never needed 
> anything other than AND dependencies in by own pipelines, but I wouldn't 
> like to claim that OR dependencies aren't useful to someone.
> 

You are an expert Tim. But majority of the cluster users are not like
you doing genome pipelines at all. When I can't say for all of them,
what I can say is, I never used any dependency resolution system on any
scheduler so far. I never felt needing it. All the rules I made are in
the software. But may be I am streching my own experience for others.

-Malay
mbasu(at)ncbi.nlm.nih.gov