Hey all; Jean-Marc, Jarl and I have been trying to work on thinking up good plans for how to organize the BL and PL and get some good connectivity going. Yesterday there was a big ol' IRC chat about this, which is pasted below. It discusses the organization of the BL and how this will interact with the PL. This discussion introduces the ideas of central "mother BLs" (MBLs) which are central controlling BLs that will interact with the DL (but maybe not directly). The mother BL can then have "Baby BLs" (BBLs) which will directly interact with the PL. This is sort of a summary of what is going on, but the actual structure is still under heated discussion, as you can see from reading the log. Part 2 of this discussion will occur today at 4:30pm EST. Anyone who is interested is welcome to join in (or listen). The chats occur on ChatNet (see www.chatnet.org for a list of servers) on the #piper channel (naturally enough). See you there! Brad --> jm (~jm at cn-997.143-201-24.mtl.mc.videotron.net) has joined #piper <bstard> hi jean-mark <jm> I'm here! <brad> Hi Jean-Marc! <jm> Have you been talking for a long time? <bstard> got the email or the icq? <jm> e-mail... got no ICQ! <bstard> almost an hour already :) <bstard> icq sucks :( <jm> ..oops I actually had the ICQ! --- bstard gives channel operator status to jm <brad> talking <- mostly about compiling the BL and fun stuff like that :-) <jm> And how's the C++ compiling going? <bstard> and I screatched a meganism for the work the BL should do <jm> Go ahead... <bstard> C++ port <- went smoothly until I met the C++ Corba <bstard> C++ corba is very diff. from C Corba. <bstard> 0: DL logs into BL <bstard> 1: DL uploads XML network to BL <bstard> 3: BL checks XMS for resources <bstard> 4: BL distributes nodes to BL instances agross internet <bstard> 5: Acknowledges DL and PL that BL is ready and waiting 4 a go <bstard> 6: {undefined yet} <bstard> step 4 will be the one that will undergo big enchancements once we've the basics operational. <bstard> stuff like resource management etc will be runtime some day in the future :) <jm> So wouldn't the BL involved in 3 and 4 be a "master" while the distributed ones would be "slave"? <bstard> oops, forgot 2: BL checks XML for authorized nodes <bstard> master-slave: ok, way of looking at thing. <bstard> things <bstard> like you view it: master = local instance, slave = remotely (on other machines) <bstard> so a BL is master if it's distributing, and a slave once it's receaving XML's <jm> Sort of, what I meant by "master" (may not be the best term) is the part that has the "global view" while all the other BL instances ("slave") only see a small part of the problem. <brad> Yeah, that is the view I had of master/slave as well. <bstard> But once the network is running, the BL instances are 'same level' instances communicating results etc to each other. The master-slave relation only exsists during the 'distribution phase' <bstard> jm: jep, that's the way I view the situation too. <brad> No, but I think that things also need to be broken up so that you can return information back to the DL, like in the "viewer" nodes. <bstard> slave = partially knowledge of the 'job', master has full knowledge <jm> Well, I think there should be a master that is the only one allowed to talk to the DL. <bstard> brad: what do you mean by 'no' ? <brad> no <- I mean, doesn't the master-slave situation also need to exist on a local level? <bstard> jm: the DL logs into a BL, so only that DL can receive, only that DL will be offered the results and can upload jobs <jm> Well, if one part of the processing is done locally, then there should be a slave running on the local machine. <bstard> nah, why an extra thread\instance? <bstard> the way I see things, only 1 BL will run on a machine. <bstard> but try to convince me :) <jm> What I meant was that all when slaves want to communicate with the UI, they should all send the info the to master. Then the master is the only one allowed to talk to the DL. <brad> convince <- Jean-Marc is probably the best person to do this, since it is his idea :-) <bstard> hehe <jm> I see the master as a library that's linked with the DL, so there's one process for the DL and the master. <bstard> uhh, how do I put this. <bstard> See the DL as the master BL, that;s probably the best way to look at the situation <jm> Also, it you have an SMP machine and want to run two part of the processing on it, you should have two BL slaves running on this machine. <bstard> all the BL's do the work for THAT specific DL <bstard> they also do work for other DL's <bstard> jm: SMP will be covered by internal threading <bstard> which is already implemented for 50%.. <jm> Why not just say one BL slave process for each part of the processing? <brad> multiple DLs <- Jarl, this is confusing me. There will only be one DL process per machine. A single process can handle multiple user interfaces. <bstard> but.. I still can not see why you people advocate a master-slave model. Try me again please.. <bstard> jm: each part of the processing will be done by the PL? So it seem more logical to thread multiple PL's <bstard> brad: did I say "Multiple DL's" ?? <bstard> If I did, a mistake, sorry <bstard> Ofcourse we'll only need 1 DL/mache <bstard> machine <bstard> arf <brad> multiple dls <- Okay, sorry. I was just getting confused :-) <jm> I'm thinkint 1PL=1BL, so there's as much BLs as there are processing parts. <bstard> brad: who aint :) <bstard> jm : good point.. <bstard> they are gonna be statically linked <bstard> which kinda sucks regarding my views :-) <jm> What's going to be statically linked? <bstard> Overlow as lib to the BL <bstard> or will we use .so <bstard> ? <jm> Well, it "has" to be dynamically linked <bstard> ic <bstard> why? <jm> Overflow is meant to work dynamically, since the toolboxes are dynamic libraries that are loaded on startup. <bstard> ok, no problem <bstard> and Overflow can only execute 1 network at a time? <jm> Basically, when Overflow starts, it looks for all the .so files in a certain path and loads them all. Each .so contains code for a bunch of nodes. That way, you can add nodes without recompiling. <bstard> ic, that's the ways GMS works with plugins too <jm> Well, Overflow is just a library, so you could build 10 network and run them at the same time if you have 10 threads. <bstard> And could you implement threading in the library after the RUN object? <bstard> so that the BL could upload multiple XML's into the PL and give it a go ? <jm> Well, it would be equivalent to the BL starting multiple threads and starting a processing in each one. <bstard> :) <bstard> hehe <bstard> so I'm buggered again? :-) <bstard> dont get me wrong <jm> The reason I prefer one PL per BL is that sometimes the PL could crash for some reason and you want to be able to track it. Also, you could restart only this part instead of all the PL's running on that machine. <bstard> I cant really oversee the situation because I dont really know how Overflow is implemented.. <bstard> If you think the best way is to have the BL do the threading that's fine to me... <bstard> crashing <- ic <brad> crash <- I think this is very important. People will really want to know why things crashed, and during development we will definately need to know this :-) <bstard> dont expect the BL to be 100 stable.. it HAS been running for days under stress testing.. but I you never know <bstard> "but I you never know" <- but you never know :) <brad> stable <- No problems, I don't expect anything to be stable yet :-) <bstard> OK, so the design will be : 1 DL -> 1 BL with internal threading -> 1 PL / BL thread ?? <jm> I think in this case, mulpiple processes are better than multiple threads. <jm> You're losing me... <bstard> aailks <bstard> loosing where? <bstard> @ the threading vrs. processes? <jm> I don't understand what you want to thread. In my mind we wouldn't need any thread at all (just separate processes). <bstard> I still am in faivor of 1 BL process / machine.. that's why I'm thing about threading <bstard> multiple BL's will be very hard to manage <bstard> darn... if only we were sitting next to each other I could draw a design :) <bstard> that would clear up thing\ <bstard> s <bstard> jm: you saw this flowchart on http://sunsite.auc.dk/gms ? <jm> The problem is that if the PL crashes, it will crash the BL and all the other PL's on the machine. <bstard> hmm... I can have the BL run inside a debugger <bstard> for as long as things are unstable <bstard> a build-in debugger <bstard> no gdb or so.. <bstard> but the ONLY thing you have again a sigle process design is the instability? Or do you just dont feel confortable with it? <jm> it's more about PL instabilities and error recovery. <bstard> hmm.,.. just a sec, I'll give somebody a call about we running the BL\PL set inside a debugger.. just to ask if it's possible. <bstard> arf... if you need people :( <bstard> no answer :) <bstard> otherwise we can have the DL do the resurrecting? <jm> it's not only about debugging but also real-life. If the PL crashes in real-life, then you only need to restart one PL. <bstard> the BL is capable of keeping track of a runtime history, so it can do a full restore after a crash <bstard> no, not debugging the errors! <brad> DL resurrection <- I can detect a BL crash very easily, but I'm not sure how I'll be able to report back to the UI on what caused the crash... <jm> Not if the crashing PL crashes the BL and the other PL before they have time to terminate. <bstard> a debuggers receives the crash signals and can respawn the BL\PL process <jm> but you loose all the computations from the other PL's <bstard> DL resurrection <- the DL can restart the BL with a special parameter, and the BL will do the rest <bstard> just like if we'd BL slaves restart a PL process. <brad> Dl resurrection <- What do you mean be special parameter? <bstard> jm: lose other computations. again good point. <bstard> like ./BL -restart :) <bstard> ok, the PL can not do history tracking.. that would become too slow. shit! <bstard> sorry <jm> I don't see what's so complicated about running two BL's on the same machine. <bstard> ok. What about: the BL will start a child\slave process for all data streams coming out of the Pipeline? <bstard> out of the BL pipeline that is. <jm> What do you mean by BL pipeline? <bstard> all incoming data streams go thru the pipeline, that does the authentication etc. <bstard> see : http://sunsite.auc.dk/gms/gms.gif <bstard> other things like marshalling data and resource management are also located in the pipeline <jm> Sorry, I don't understand your flowchart... <bstard> if we are to use multiple BL's on 1 machine that will become very hard <bstard> look only at the middle part, with 'outside world' above it <bstard> the 'sensors' are the data streams coming from external sources.. <jm> What's harder? For me it's just communicate with another BL, regardless of where it is (local or remote) <bstard> the basic of gms is that there is only 1 pipeline on a machine. That's why I make the fuss about multiple BL's :-) <bstard> so the plan of stripped (non-pipeline, no DL-communicating, non-external input) slave BL's sounds ok? <bstard> or dont you get it? <jm> Sorry, I think I'm missing some basic information... <bstard> to you ( the PL) there wont be a diff. Also to the DL there's no diff. and the BL design will remain intact <bstard> jm: maybe we need to discus this on voice some day? <bstard> brad: you're still here? :-) <jm> Well, you're the BL specialist after all... it's just that from my point of view, things would look so much cleaner if "one part of the processing" = 1 BL "slave" = 1 PL, regardless of where everything runs. <brad> Yup, just listening :-). This is your guys battle. <jm> ...Just thinking... is the multiple BL problem about the authentication and stuff like that? <bstard> jep. about deadlocks and so on. and about resource management <jm> In this case, maybe there could be a permanent part of the BL on each machine, which would spawn (process) Baby-BL's linked with the PL. <bstard> that's why I only wnat to have 1 process doing all that on 1 machine <bstard> YEP! <bstard> the BL ' <bstard> master' will be permanent <bstard> those 'slaves' will be spawned one there's something to do <bstard> 1 spawned stripped-down BL / PL <jm> once? <bstard> and die when the job is done <jm> That's what I'm thinking about... it's just that I had forgotten about all that authentication/resource thing. <jm> Can you wait a sec... <bstard> ok, pfeeew <bstard> we've settled something it seems :) <brad> Good stuff! Baby-BLs are the way to go :-) <bstard> ofcourse I'll wait... /me having a thinking-smoke <bstard> brad: lol <bstard> hehe <bstard> BBL's :) <brad> :-) <brad> I'm happy you are coming together on that. The Baby-BLs actually makes a lot of sense with what both of you were saying. <brad> Maybe we can have parent/baby relationship, instead of master/slave. Sounds a little nicer :-) <bstard> brad: could you build into the DL a respawn check for the Mother BL ? <bstard> the MBL so to say <brad> What do you mean by a respawn check? <bstard> if the MBL dies, the DL should restart the thing <bstard> like the BBL's do for the PL' <bstard> s <brad> Yeah, that is no problem. We should add a little ping() function to the IDL, then when the ping fails, the DL will restart the BL. <jm> Well the BBL would link to the PL, so both die at the same time <jm> it would be the mother BL that would respawn a BBL/PL <bstard> brad: ping <- yep, I'll put it into the IDL <bstard> jm: jep! <brad> So what is our plan for actually implementing all of this? <bstard> the MBL will do all the resource\AAA and (re)spawning work <jm> As for what I was calling the "master", it could belong to the DL or BL (I don't care), and that's the one I see responsible for splitting the processing, sending that to the MBL's and respawning MBL's <bstard> Actual impl. <- 1stly the C++ Corba should be done. After that I'll be eager to do some coding of new features :) <bstard> jm: I see what you mean now. I think you can see my initial opposition to this now too can you? <jm> opposition to what? <bstard> to the master-slace design <bstard> slave <brad> impl <- Well, how about if we split like this: I'll work on getting the CORBA C++ working and you and Jean-Marc can work on implementing this whole mother/baby thing.... <bstard> if slave start to do authentication etc the system will become a big mess <jm> I'm not sure you understand what I mean. What I was calling "master" is more like a library that the DL would use to communicate with all the mother-BL's <bstard> brad: I can not devel that without the sources compiling :( <bstard> that darn corba must be running soon <jm> For me master BL != mother BL. <bstard> jm: ?? master != mother? <jm> OK, let me say how I now see the whole thing: <brad> impl <- Well, lets extract the CORBA out into its own separate module. Can you just write a little C++ code that will run to test things, and then we'll make things be driven by the CORBA subsequent to that? I think the CORBA should come out of the messaging_db stuff for cleanliness maybe anyways... <jm> The DL links (as library) with the "master"-BL, which takes care of splitting the processing into small XML chunks that are sent to all the mother-BL's on the different machines. Then each mother-BL spawns a Baby-BL/PL for each part it receives. <bstard> brad: let's not create too much chaosshall we? <bstard> shall we 1st finish the corba, and once it's working I'll separate the corba and the database, ok? <bstard> jm: three BL's? <bstard> hmm <jm> Assuming the flow A->B->C where A and C are run on machine X, B is run on machine Y, and the DL is run on machine Z, here are the processes I see: <jm> 1 process on Z (the DL/master-BL) <jm> 3 processes on X (1 Mother-BL, 2 baby-BL's) <jm> 2 processes on Y (1 mother-BL, one baby-BL) <jm> Once again, the confusion probably arises from the fact that you probably don't consider my "master-BL" to be part of the BL... <brad> impl <- I'll put this on hold until you and Jean-Marc are done... <bstard> brad: k :) <bstard> jm: ic what you mean <bstard> I see this master BL and mother BL as being the same process <bstard> the motherBL gets a chunk of XML from the DL, splits it, sends it to other motherBL's and the motherBL's start producing baby's <jm> I see all the mother-BL's being equal -> no reason why the splitting would be done by one particular instance of mother-BL. <bstard> and maybe the 'other' motherBL's will also split the XML once more if they see fit <bstard> hehe <bstard> jm: still see the need for a 3th BL type, the master BL? <jm> ...or you can call it master-DL if you like, since I don't see where it belongs. <jm> I think it's not the mother-BL's job to split the processing in chunks. ...if you don't do processing on the local (where the DL runs) machine, you don't need a mother-BL there. <bstard> but the BL will manage the resource management, this can not be done without doing the xml splitting <jm> Sure, but since there are many BL's, the splitting decision has to be split somewhere. There should be a part that only does that. <bstard> there will be only 1 MBL on a machine. So it's capable of desiding the splitting <bstard> why introduce a new process for just doing the splitting? <bstard> the MBL will do 1: authenticate, 2: recouremanagement\splitting 3: spawning BBL's <jm> Yes, but how do you decide how to split between different machines. What I'm saying is that there should be a part that's in the middle of the DL and all the mother-BL's <jm> there are many MBL's, right? one on each machine <bstard> I think a MBL should "take what it want" and "pass on what it doesn't like to other MBL's" <bstard> no, just ONE MBL / machine <bstard> many BBL\PL sets <jm> ...OK, let's go from the start. <bstard> ok :-) <jm> If we have 10 machines, we have 10 MBL's (one on each machine). Let's assume the DL runs on a 11'th machine. <bstard> ok.. <jm> The DL gets from the GUI a complete (unsplit XML network). <jm> (remember, there's no MBL running where the DL is) <jm> There's no reason the DL would ask one particular (of the 10) BL's to do the splitting. <bstard> there is <bstard> the DL doesn't have login access to every MBL <bstard> but the DL COULD have it.. <bstard> very strange situation though <jm> I think there should be "something" on the local machine that (maybe after communicating with all the MBL's) decides how the program should be split and then sends each part to the right MBL. <jm> Why not have a special part of the BL that would do the interface between the DL and all the MBL's (on all the machines) <bstard> what about making this "something" this logic: "take what it want" and "pass on what it doesn't like to other MBL's" <bstard> that way the splitting can start on every system on the net in theory.. <jm> I think we disagree on how the processing will be split... <jm> From what I understand, you say that only the BL will decide about the splitting. Am I right? <bstard> you say splitting should be done by a splitting master, I think we can rely upon smart algorithms that will do the job without any 'mastering' <jm> I'm more thinking that the user will say "This runs here, that runs there, ..." <bstard> the user CAN say that <jm> Only, sometimes, the user want something to run somewhere... and nobody's going to decide something else. <bstard> that way the splitting will be very easy <bstard> but what if the user says 'just runs it somewhere where it will be done fastest' ? <jm> In this case the local "splitter" would ask all the MBL's about their resources, but I think most of the time the decision will not be made by the MBL's. <jm> Let's say I have a sound acquisition node, I want it to record from a certain microphone that's on a certain machine... not to record on any microphone, where the BL like to. <bstard> I think that way will build a very static system. You should have more faith in having the MBL's fighting about getting the optimal resource load <bstard> microphone: ok, in that case you define to use THAT specific mic. <bstard> Splitting highly defined xml will be easy, we'll just pass the xml on to where the user wants it to be <bstard> I'm more interested into xml that does NOT contain any localisation defines <jm> I think the user needs to have a word to say about the splitting. <bstard> sure he\she does! <bstard> the user work with 'that file' and 'this soundcard' <bstard> but also with 'some computation power, where the heck it might be' <bstard> or 'information about this' <jm> What not have all the MBL's send their resource info at the same place and then combine that with what the user likes and take the decision *in one place*. <bstard> how will this be done once Piper runs on 103430943 machine some day in the future? <bstard> machineS <bstard> arf <jm> There are always things the user will know that the BL won't know... like "if I run this part that lauches this process on this machine, than this user will be mad at me because of this side effect". There's no way the BL can handle that! <bstard> ok, that's why the MBL will get strict resource limits <bstard> maxcpu = xx% <bstard> maxmem = xx mb <bstard> etc <bstard> but ok, you're a tiny bit right :) <jm> sometimes, it's just that a certain node needs more io or does something else that disturbs... <bstard> the BL wont know everything.. so wont the 'splitter BL' <jm> Yes, but the user can speak to the splitter BL and help it do the splitting. <bstard> that's why I want to implement a 'prediction' neural net once we've Piper running.. to predict availeble resources <bstard> hmm.. you think the user should guide the splitting? I think the user can define anything he wants, after which it's up to the BL to fill in the gaps <jm> Anyway, I don't see how distributing the splitting would be better than making a centralized decision taken from the data (resources) of all the BL's. <jm> Note that I don't see how the BL will know what resources (CPUt time) are going to be required for each node, since it will not know what the nodes do. <bstard> 1st because a centralized meganism wont be possible once there are a lot of Piper's running, and 2ndly because I know prediction algoritms are much more effective than centralized algoritms <jm> When I write new nodes, I don't want to have to tell the BL how long it will take to run, depending on all the variables (size of the problem, ...). <bstard> not know how much it will use in advance <- ok, not the 1st time the node runs, but here the history logs and prediction algoritms come into play <bstard> but I hope you'll reuse those nodes a next time you build a network <bstard> so the MBL's will know upon hostory what that nodes consumed the previous time <jm> not exactly... there are some things you cannot predict... like how long it will take that algorithm to converge according to the caracteristics of the data. <bstard> ok, such nodes will exsist. Those nodes will get the label 'unpredictable'. <bstard> :) <jm> Let's say my neural network will take twice longer to converge if I train it with speech than if I train it with music... how will the BL know what files I'm senting to it... also, even the idea of how long it will take to converge is very intuitive. <jm> The problem is that these "unpredictable" nodes will often be the most resource consuming... in my case at least. <bstard> ok, but nodes will consume resources in more-or-less the same order of magnitude <bstard> maybe this is something that needs experimenting.. <bstard> To me the design of having 1 central boss BL for each network will be limiting. <jm> Also, even with many Piper's running, I don't see the problem of centralizing decisions... Each MBL tells the splitter how much resources it has left, the splitter receives all that info and takes the decision. <bstard> hmm... shall we continue this discussion tomorrow, it's 23:50 here. I'm a bit tired. <bstard> maybe I see things much clearer tomorrow :) <bstard> I'll give your point of view some brainpower tomorrow... let you know more next time, ok? <bstard> brad: bye, you're around tomorrow too? <brad> Sure, yeah, I can be really quiet tommorrow as well :-)