[Bioclusters] Is the "OR" job dependency useful??
Malay
mbasu at mail.nih.gov
Fri Jan 7 13:46:13 EST 2005
Tim Cutts wrote:
>
> On 6 Jan 2005, at 5:49 pm, Malay wrote:
>
>> Rayson Ho wrote:
>>
>>> Gridengine currently has the "AND" operator job dependency:
>>> A,B -> C
>>> ie. we need to wait for job A and B finish before we start job C.
>>> There are discussions on the SGE dev mailing list about adding the OR
>>> job dependency:
>>> A|B -> C
>>> So job C will start as soon as job A or job B finishes.
>>> I am wondering if this is useful in bioinformatics job flows??
>>
>>
>> As far as bioinformatics goes I am afraid most of the bioinformatics
>> applications are embarassingly independant :) Although such dependancy
>> resolution issues will have it's niche application but I guess it's
>> very limited as far as bioinformatics goes.
>
>
> I don't think that's true - when you consider something like a gene
> annotation process, there are lots of dependencies. Consider what goes
> on with Ensembl; before any analyses are performed, the sequences have
> to be dusted and RepeatMasked. After that raw features such as blast
> hits, ab initio gene predictors and EST alignments can be calculated.
> Once the BLAST hits have been done, genewise alignments can be performed
> (using the BLAST results to narrow down the areas genewise needs to
> analyse). Only once the EST alignments, ab initio predictors and
> genewise are complete can the code be run to combine these into a
> coherent set of gene structures.
A pipeline of any kind by nature depends on previous process.
A -> B -> C
I don't understand what do you mean by jobs here. These rules can't be
hardcoded in scheduler, or can you?
In bioinformatics each of these steps is acutally not a job at all they
are what they called "steps". Each of these steps like A is composed of
1000,000 BLAST jobs which has no dependency on each other.
>
> Although each of these processes consists of thousands of independent
> jobs, each type of analysis is dependent on the completion of the
> previous ones.
As I said. But do you actually suggest completing a "job" pipeline
before a "step" pipleline. Do you actually carry out the analyis of a
small reginon of genome sequence and finish it to end, or finish the
blast searches for the whole genome at a time?
> As it happens, all of these dependencies are handled in the Ensembl
> RuleManager rather than by the scheduling system.
That what I meant! The whole dependency issue is in user space, and can
be very well maintained my user software. In a software world,
unnecessary means, "thing can be managed by easier way".
> They're all AND dependencies as far as I can tell, and I've never needed
> anything other than AND dependencies in by own pipelines, but I wouldn't
> like to claim that OR dependencies aren't useful to someone.
>
You are an expert Tim. But majority of the cluster users are not like
you doing genome pipelines at all. When I can't say for all of them,
what I can say is, I never used any dependency resolution system on any
scheduler so far. I never felt needing it. All the rules I made are in
the software. But may be I am streching my own experience for others.
-Malay
mbasu(at)ncbi.nlm.nih.gov
More information about the Bioclusters
mailing list