[Bioclusters] Is the "OR" job dependency useful??

Fri Jan 7 17:59:39 EST 2005

On 7 Jan 2005, at 10:33 pm, Malay wrote:

> Thanks Tim for your details reply. But I am still not sure whether it 
> is the right path to go. Here is my reply.
>
>
> Tim Cutts wrote:
>> On 7 Jan 2005, at 6:46 pm, Malay wrote:
>>> A pipeline of any kind by nature depends on previous process.
>>>
>>> A -> B -> C
>> It's not necessarily as linear as that.  You can sometimes have 
>> parallel tasks.  Ascii art will probably defeat me here, but 
>> consider:
>>   B-C
>>  /   \
>> A     F-G
>>  \   /
>>   D-E
>> Things like this are not uncommon.
>
> In no batch queuing system a circular relationship can be established. 
> You need to decompose this circular relation into linear one to submit 
> to job queu.

Told you ascii art would defeat me - all those arrows were supposed to 
be read from left to right.

>> Yes, but that means every time someone has to write a pipeline they 
>> have to write stuff to manage their own dependencies, whereas if the 
>> scheduler can do it, the pipeline code the user has to write is much 
>> simpler.
>
> Here I disagree. I'll give you an example. As you are talking for LSF, 
> I can tell you my own experience. If one of the LSF nodes has NFS 
> problem then LSF job never returns. The job just stays there forever 
> unless it's killed manually. If you don't code for timeout values 
> there is no way you can relaunch the job. But doing it is far more 
> easier  to check for timeout and killing and respawing it again, using 
> a simple script. Any dependency rule that you mentioned will make the 
> whole pipleline stuck. Atleast LSF is not as intelligent as user 
> coding can be. It's about fine grain control.

That's true.  But there are cases where neither level will work.  Lets 
say that node drops off the network for some reason.  The job might 
still be running, successfully, or might at least just be blocked until 
the network comes back.

In that case, LSF will mark the job as UNKWN, and neither LSF nor your 
script can do anything about it until the node comes back.  You have no 
way of knowing whether the job is still running OK, and you can't kill 
and restart it. (well, you can use bkill -r, but that's risky - if the 
job is still running correctly, but you've re-run the job somewhere 
else you might end up with duplicated results)

>> Ensembl only has its own rule manager because it is designed to be 
>> independent of the batch queueing system in use.
>
> In other words making the user's life easy.

Yes, but the flip side is the rule manager has to make repeated polling 
requests to the scheduling system.  This can impose so much load, 
especially if several pipelines are being run by a few users, that the 
mbatchd starts to suffer.

> The nature of makeshift clusters are such that they are not highly 
> dependable still. There are zillions of problem. As the whole system 
> is not dependable, having coarse grain control of the system using a 
> higher-level software is my opinion is building a could castle. May be 
> one day the system will be dependable enough to use such high level 
> logic. For the time being I am sticking to my script.

I think you're probably right - I wouldn't depend on LSF altogether.  
But our cluster isn't that makeshift; it's made from dedicated 
machines, most of which are actually pretty reliable, especially the 
newer Xeon blades.

Tim

-- 
Dr Tim Cutts
Informatics Systems Group, Wellcome Trust Sanger Institute
GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5  860B 3CDD 3F56 E313 4233