On 7 Jan 2005, at 10:33 pm, Malay wrote: > Thanks Tim for your details reply. But I am still not sure whether it > is the right path to go. Here is my reply. > > > Tim Cutts wrote: >> On 7 Jan 2005, at 6:46 pm, Malay wrote: >>> A pipeline of any kind by nature depends on previous process. >>> >>> A -> B -> C >> It's not necessarily as linear as that. You can sometimes have >> parallel tasks. Ascii art will probably defeat me here, but >> consider: >> B-C >> / \ >> A F-G >> \ / >> D-E >> Things like this are not uncommon. > > In no batch queuing system a circular relationship can be established. > You need to decompose this circular relation into linear one to submit > to job queu. Told you ascii art would defeat me - all those arrows were supposed to be read from left to right. >> Yes, but that means every time someone has to write a pipeline they >> have to write stuff to manage their own dependencies, whereas if the >> scheduler can do it, the pipeline code the user has to write is much >> simpler. > > Here I disagree. I'll give you an example. As you are talking for LSF, > I can tell you my own experience. If one of the LSF nodes has NFS > problem then LSF job never returns. The job just stays there forever > unless it's killed manually. If you don't code for timeout values > there is no way you can relaunch the job. But doing it is far more > easier to check for timeout and killing and respawing it again, using > a simple script. Any dependency rule that you mentioned will make the > whole pipleline stuck. Atleast LSF is not as intelligent as user > coding can be. It's about fine grain control. That's true. But there are cases where neither level will work. Lets say that node drops off the network for some reason. The job might still be running, successfully, or might at least just be blocked until the network comes back. In that case, LSF will mark the job as UNKWN, and neither LSF nor your script can do anything about it until the node comes back. You have no way of knowing whether the job is still running OK, and you can't kill and restart it. (well, you can use bkill -r, but that's risky - if the job is still running correctly, but you've re-run the job somewhere else you might end up with duplicated results) >> Ensembl only has its own rule manager because it is designed to be >> independent of the batch queueing system in use. > > In other words making the user's life easy. Yes, but the flip side is the rule manager has to make repeated polling requests to the scheduling system. This can impose so much load, especially if several pipelines are being run by a few users, that the mbatchd starts to suffer. > The nature of makeshift clusters are such that they are not highly > dependable still. There are zillions of problem. As the whole system > is not dependable, having coarse grain control of the system using a > higher-level software is my opinion is building a could castle. May be > one day the system will be dependable enough to use such high level > logic. For the time being I am sticking to my script. I think you're probably right - I wouldn't depend on LSF altogether. But our cluster isn't that makeshift; it's made from dedicated machines, most of which are actually pretty reliable, especially the newer Xeon blades. Tim -- Dr Tim Cutts Informatics Systems Group, Wellcome Trust Sanger Institute GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5 860B 3CDD 3F56 E313 4233