I've been busy with school lately (in fact, I really should be studying right now for an exam Monday), so I haven't gotten much of anything done. However, I've been reading over BioML, BSML, and the bioperl site, and I have some ideas about the markup language. First, reading BSML files makes a lot of things seem overly complex. Second, BioML looks cleaner, but I hate the organism tag enclosing everything. While that information could be useful for a structure or sequence, it would be better to reference it, rather than enclosing it. Also, BSML doesn't seem to cover protein sequences, while BioML does. However, BSML does seem to allow for more thorough definition of features in the sequence. Aesthetically, I prefer BioML over BSML, and I think that's just because BioML uses different tag names for various features of the sequences, while BSML just has a general feature tag with lots of options. Also, BSML, and even BioML to a degree, try to define display information as well. Do we want that in our ML? I can't see why we would need it, since we have an intelligent client. BSML seems to be intended for direct display in a generic BSML browser, in addition to defining data. BSML has a second DTD with that layout stuff removed, however. BioML has tags for forms, which seem totally unnecessary. I would like to effectively merge BioML and BSML, incorporating protein sequence information and feature specification, and use more descriptive tag names (like BioML) for defining the sequences and features. I wouldn't put any layout information in. Does anyone think we need it? Also, for structure, there don't appear to be any MLs even attempting to do this, with the exception of CML. So, my idea is to take the PDB file format and XMLize it. If any of you know any glaring holes in PDB let me know, and we can work around those. Also, these sections will need some tags to allow for defining relationships between multiple objects. It might describe homology, alignment, etc. between two or more sequences, or for structures, it might relate 3D similarities, regions of high interaction (binding probabilities through free energy calculations), and other similar concepts. Generated data should also return information about the analysis process, like the algorithm used, statistical probabilities, etc. Now that is just the "data" section. A LociML file will have a variety of additional information as well. We'll probably need control, status, and query sections, too. Control has to describe the analysis pathway. Status is information concerning the data returned at each analysis step. Query has to hold the actual query at each step. Now, the control section is fairly straightforward, as is the status section, although both will need to be fairly flexible. Incidental information concerning an analysis that might be useful to the client. I don't really have any good examples, but I imagine some will come up. The query section is more complex, but here's my idea: When the user creates the analysis pathway, all of query commands are generated at that time as well, but it can make use of variables referencing data from queries in earlier stages. The workflow system will fill in the variables for a query before sending it off for that analysis. Here is a crude example: <control current="1"> <step stage="1" server="paosp://some.host/whatever"/ id="q1"> <step stage="2" server="paosp://foo/" id="q2"> </control> <status> <step id="q1" state="processing"> <message>Analyzing sequence...</message> </step> </status> <data> <step id="q1"> <protein>lots of other stuff here</protein> <!-- note: this wouldn't show up until after the q1 step is done --> </step> </data> <query> <step id="q1"> <data> <dna id="aaa">....</dna> </data> <operation type="translate"> <input id="aaa"/> <some other data for translation> </operation> </step> <step id="q2"> <data> <protein id="bbb"> <variable>data.step[q1].protein.*</variable> </protein> <protein id="ccc">...</protein> </data> <operation type="homology"> <input id="bbb"/> <input id="ccc"/> <other stuff for query> </operation> </step> </query> There obviously needs to be a lot of detail filled in here, but I think this gets my basic idea across. Also, there's no particular reason there couldn't be multiple entries for a stage. That's why I defined every component of a query by an id, rather than by it's stage. Since the first few steps of an analysis pathway might not depend on previous data, we could have multiple steps occuring simultaneously. There's no reason for all of the steps to be sequential. This would be especially true of a pathway which had a number of database queries. Actually, we could probably get rid of the whole ordering thing completely, since the wfs could just figure out dependencies by the variable references in the queries. Of course, the interface for this could be more complicated... Also, it probably makes more sense to move all of input data into the data section, and have the query reference it there. Also, the format of specifying variables and input in general will probably need to be improved. In terms of implementation, I imagine it would work like this: The wfs identifies queries it can currently run, and creates a Paos object on the specified server, giving it only the portions of the xml file necessary for it to run (query and relevant data sections). The input data goes into one attribute of the Paos object. The remote analysis system creates a second attribute containing for status tags, and when it's complete, it creates an output section with it's new data. The wfs can frequently grab the status attribute on the object, since it's small, and update it's local copy for any clients who want to know what is going on. When the analysis is complete, the wfs grabs the output attribute off of the remote object and updates it's copy, and moves on. The remote analysis system just drops it's object once it has been acknowledged by the wfs. Any thoughts on the markup language, the query syntax, variable references, asynchronous analyses, or the workflow system (wfs, if you were wondering what I was refering to)? I'll start my BioML, BSML, PDB merger/implementation/cleanup. Once we agree on how Loci works underneath, a rough wfs/paos/gatekeeper system can be set up fairly quickly. Then just a quick python wrapper around some analysis tool and a simple viewer program will give us a functioning system (not a particularly easy to use system, but functioning nonetheless). Justin Bradford justin at ukans.edu