[Bioclusters] Is the "OR" job dependency useful??
Tim Cutts
tjrc at sanger.ac.uk
Fri Jan 7 17:05:53 EST 2005
On 7 Jan 2005, at 6:46 pm, Malay wrote:
> A pipeline of any kind by nature depends on previous process.
>
> A -> B -> C
It's not necessarily as linear as that. You can sometimes have
parallel tasks. Ascii art will probably defeat me here, but consider:
B-C
/ \
A F-G
\ /
D-E
Things like this are not uncommon.
> I don't understand what do you mean by jobs here. These rules can't be
> hardcoded in scheduler, or can you?
Not hardcoded in the scheduler as such, but you can tell the scheduler
what the dependencies are in LSF, and probably in others too. The
above sets of jobs in LSF would be done with:
bsub -JA ...
bsub -JB -w'done(A)' ...
bsub -JC -w'done(B)' ...
bsub -JD -w'done(A)' ...
bsub -JE -w'done(D)' ...
bsub -JF -w'done(C) && done(E)' ...
bsub -JG -w'done(F)'
> In bioinformatics each of these steps is acutally not a job at all they
> are what they called "steps". Each of these steps like A is composed of
> 1000,000 BLAST jobs which has no dependency on each other.
In LSF, multiple jobs or job arrays can be given the same name with the
-J parameter, and then the dependency condition applies to all jobs
with that name.
> As I said. But do you actually suggest completing a "job" pipeline
> before a "step" pipleline. Do you actually carry out the analyis of a
> small reginon of genome sequence and finish it to end, or finish the
> blast searches for the whole genome at a time?
The Ensembl pipeline does a mixture of both.
> That what I meant! The whole dependency issue is in user space, and can
> be very well maintained my user software. In a software world,
> unnecessary means, "thing can be managed by easier way".
Yes, but that means every time someone has to write a pipeline they
have to write stuff to manage their own dependencies, whereas if the
scheduler can do it, the pipeline code the user has to write is much
simpler.
Ensembl only has its own rule manager because it is designed to be
independent of the batch queueing system in use.
Letting LSF get on with it is a lot simpler than having some nasty, and
hard to write, code which polls the scheduling system to check that the
previous jobs have finished before the next lot can be started.
I imagine SGE can do all this stuff too.
Tim
--
Dr Tim Cutts
Informatics Systems Group, Wellcome Trust Sanger Institute
GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5 860B 3CDD 3F56 E313 4233
More information about the Bioclusters
mailing list