[Pipet Devel] Loci markup language and infrastructure things

Sat Feb 27 18:24:27 EST 1999

I've been busy with school lately (in fact, I really should be studying   
right now for an exam Monday), so I haven't gotten much of anything done. 

However, I've been reading over BioML, BSML, and the bioperl site, and I  
have some ideas about the markup language.

First, reading BSML files makes a lot of things seem overly complex.
Second, BioML looks cleaner, but I hate the organism tag enclosing 
everything. While that information could be useful for a structure or 
sequence, it would be better to reference it, rather than enclosing it.

Also, BSML doesn't seem to cover protein sequences, while BioML does.
However, BSML does seem to allow for more thorough definition of features
in the sequence.

Aesthetically, I prefer BioML over BSML, and I think that's just because
BioML uses different tag names for various features of the sequences,
while BSML just has a general feature tag with lots of options.

Also, BSML, and even BioML to a degree, try to define display information
as well. Do we want that in our ML? I can't see why we would need it,
since we have an intelligent client. BSML seems to be intended for direct
display in a generic BSML browser, in addition to defining data. BSML has
a second DTD with that layout stuff removed, however. BioML has tags for
forms, which seem totally unnecessary.

I would like to effectively merge BioML and BSML, incorporating protein
sequence information and feature specification, and use more descriptive
tag names (like BioML) for defining the sequences and features. I wouldn't
put any layout information in. Does anyone think we need it?

Also, for structure, there don't appear to be any MLs even attempting to
do this, with the exception of CML. So, my idea is to take the PDB file
format and XMLize it. If any of you know any glaring holes in PDB let
me know, and we can work around those.

Also, these sections will need some tags to allow for defining
relationships between multiple objects. It might describe homology,
alignment, etc. between two or more sequences, or for structures, it
might relate 3D similarities, regions of high interaction (binding
probabilities through free energy calculations), and other similar
concepts.

Generated data should also return information about the analysis process,
like the algorithm used, statistical probabilities, etc.

Now that is just the "data" section. A LociML file will have a variety of
additional information as well. We'll probably need control, status, and 
query sections, too.
Control has to describe the analysis pathway.
Status is information concerning the data returned at each analysis step.
Query has to hold the actual query at each step.

Now, the control section is fairly straightforward, as is the status
section, although both will need to be fairly flexible. Incidental
information concerning an analysis that might be useful to the client. I
don't really have any good examples, but I imagine some will come up.

The query section is more complex, but here's my idea:
When the user creates the analysis pathway, all of query commands are
generated at that time as well, but it can make use of variables
referencing data from queries in earlier stages. The workflow system will
fill in the variables for a query before sending it off for that analysis.

Here is a crude example:

<control current="1">
 <step stage="1" server="paosp://some.host/whatever"/ id="q1">
 <step stage="2" server="paosp://foo/" id="q2">
</control>
<status>
 <step id="q1" state="processing">
  <message>Analyzing sequence...</message>
 </step>
</status>

<data>
 <step id="q1">
  <protein>lots of other stuff here</protein>
  <!-- note: this wouldn't show up until after the q1 step is done -->
 </step>
</data>

<query>
 <step id="q1">
  <data>
   <dna id="aaa">....</dna>
  </data>
  <operation type="translate">
   <input id="aaa"/>
   <some other data for translation>
  </operation>
 </step>
 <step id="q2">
  <data>
   <protein id="bbb">
    <variable>data.step[q1].protein.*</variable>
   </protein>
   <protein id="ccc">...</protein>
  </data>
  <operation type="homology">
   <input id="bbb"/>
   <input id="ccc"/>
   <other stuff for query>
  </operation>
 </step>
</query>

There obviously needs to be a lot of detail filled in here, but I think
this gets my basic idea across.

Also, there's no particular reason there couldn't be multiple entries for 
a stage. That's why I defined every component of a query by an id, rather 
than by it's stage. Since the first few steps of an analysis pathway 
might not depend on previous data, we could have multiple steps occuring
simultaneously. There's no reason for all of the steps to be sequential.
This would be especially true of a pathway which had a number of database
queries. Actually, we could probably get rid of the whole ordering thing
completely, since the wfs could just figure out dependencies by the 
variable references in the queries. Of course, the interface for this
could be more complicated...

Also, it probably makes more sense to move all of input data into the
data section, and have the query reference it there. Also, the format
of specifying variables and input in general will probably need to be
improved.

In terms of implementation, I imagine it would work like this:
The wfs identifies queries it can currently run, and creates a 
Paos object on the specified server, giving it only the portions
of the xml file necessary for it to run (query and relevant data
sections). The input data goes into one attribute of the Paos object.
The remote analysis system creates a second attribute containing for
status tags, and when it's complete, it creates an output section
with it's new data. The wfs can frequently grab the status attribute
on the object, since it's small, and update it's local copy for any
clients who want to know what is going on. When the analysis is 
complete, the wfs grabs the output attribute off of the remote 
object and updates it's copy, and moves on. The remote analysis 
system just drops it's object once it has been acknowledged by the 
wfs.

Any thoughts on the markup language, the query syntax, variable
references, asynchronous analyses, or the workflow system (wfs, if
you were wondering what I was refering to)? I'll start my BioML, 
BSML, PDB merger/implementation/cleanup. Once we agree on how Loci
works underneath, a rough wfs/paos/gatekeeper system can be set up
fairly quickly. Then just a quick python wrapper around some analysis 
tool and a simple viewer program will give us a functioning system 
(not a particularly easy to use system, but functioning 
nonetheless).

Justin Bradford
justin at ukans.edu