[Pipet Devel] BioML vs BSML

Tue Jan 26 22:52:19 EST 1999

> How about making our own XML?  I think having four XML's has already diluted the
> field so that we can't complain about our XML being a proprietary format.  I
> think Justin and Konrad could coordinate this effort, and the others can offer
> input on sequence representations.  Really, we can get much of the sequence part
> from what we like about BSML and BioML.

I was thinking the same thing too. Nothing seems to do exactly what we
want, and it will be simpler for querying purposes if we only deal with
one XML. Conversion to other formats could also be done, for exporting the
data outside of the Loci system.

> Give me some feedback.

It would help if the people with lots of experience using existing formats
could comment on how they'd like it to work.

> > But we also want control information tagging along with the object? And
> > that would also be XML data?

What about storing control information in the Paos object, rather than in
the XML? Or could we make the Paos object a mirror of the XML format?
The purpose of this should become clearer as I explain other things.

> > Furthermore, I'd like it if this thing could query/update databases, too
> > (ie, a glyph for submitting my new protein structure to Brookhaven, or get
> > the sequence for some gene out of the GDB, etc.)
> 
> You mean have a Loci _tool_ for this?  You're not talking about XML here.

Well, whatever we use to describe queries should be capable of querying
and updating databases, ideally. That way, a database dependent step could
be as simple as an analysis step. This would require a gatekeeper
interface, of course. I just want to make sure we can fit it in
seamlessly.

> Yes, I think Paos can reside on both the server and client.  Carlos will have
> some documentation for us that can clear things up, and I think there is a
> README at the Paos Web site.

A Paos client has to make a connection to a Paos server. Therefore, there
must be a Paos server answering requests wherever an analysis tool is
located.

> > Also, a workflow/batch control system is in charge of directing the
> > movements of the object (via Paos). In case of failure, the Paos object is
> > updated with some exception, and the workflow system is notified and deals
> > with it appropriately.
> 
> Yes sir!

But there has to be something around constantly to monitor these Paos
objects throughout their lifetime. This would be the workflow system
(wfs). It would be responsible for directing objects, keeping track of
their status, and providing an interface for the user to check up on it.

> > Throughout this process, the workflow system is also updating the Paos
> > object with current status
> 
> The XML object can be changed, yes.

Now is the XML object in the Paos object, or are they the same thing?
Since Paos can deliver updates on only specific attributes, I wanted to
take advantage of that. Like I mentioned earlier, the Paos object could be
a representation of the XML format we create, or it could contain XML data
from analysis steps. In the latter case, other attributes of the Paos
object would contain status and control information. That way it could be
updated "atomically", regardless of the other XML data it contains.

> > If so, it makes sense to use the Paos object to store control, exception,
> > and status info. Data for anaylsis and analyzed data are stored in
> > separate attributes.
> 
> Yes.  These are complications that may require us to write our own XML.

Again, would it make sense for this to be in the Paos object, the XML it
contains, or is their a difference?

> > The gatekeeper takes the data from the appropriate
> > attribute (as told by relevant control information), modifies it as
> > necessary for the analysis tool, and runs that tool.
> 
> Now we are back to analyzing the XML data (Paos object), back up to where I
> typed @@@.  These are not two types of analyses.  The gatekeeper will work with
> the workflow system, etc.

Maybe. I was thinking that the Paos object contained the XML data in an
attribute, which was extracted and presented to the gatekeeper depending
on what it was supposed to do with it.
But if the whole Paos object is an XML representation, then the gatekeeper
takes what it needs.

> > In this model, the workflow system is a Paos server/client combo. It
> > would get the original object from the user, hand that to an analysis
> > server, but keep a local copy updated, which the user (status monitor)
> > would access for updates.
> 
> I'm not sure about keeping a local copy of the data.  You say that the data
> would updated, which would require the whole XML object to be transferred many
> times.  I was thinking only once at the end, but the analysis locus could just
> keep reporting what is being done...like writing a log file.

Ok. I think you had envisioned just the gatekeeper just dealing with the
whole XML file, which contained control and status info, and was stored
in the Paos object. I want to take advantage of the object nature of Paos,
and use multiple attributes on the object. One for control, one for
status, one for data storage (the XML returned by the analyses).
That way status could be updated individually of the rest of the XML data.
Of course, it would be even better if the Paos object was simply a
representation of the XML data. Then analyses could be updated atomically,
too. Also, this way, the user client wouldn't have to parse XML. It would
be provided with an object-oriented view of it right away. Sort of like
DOM, which we could even provide an interface, too.

> > ...and then repeat the whole process (ie.
> > give the object to the next analysis server, ...)
> 
> Yes, when GCL is used to automate some analyses.

GCL is used to build the control data. The workflow system does the work,
according to the control information.

> > All the user client stuff access the workflow system directly, which deals
> > with the individual analysis servers. This runs as a separate process, so
> > you might have a server running this. The client starts up his Loci
> > GCL program on a networked computer anywhere, builds the analysis batch,
> > starts it, gets an ID number, and can close the program and walk away.
> 
> I never thought of that, but it's a great idea!

I had considered the possibility of our objects (which, for clarification,
refers to a batch of controls, data to be analyzed, data already analyzed,
and various status information) roaming independently of a "central"
server. They could be passed from gatekeeper to gatekeeper directly.
However, that would make it impossible to monitor them, unless the object
"called home" every now and then. But I don't like that. It makes more
sense for the user to query the object when it wants information.
For that to be possible, there has to be some constant, central server
which is watching the object. This would be the workflow system.
It's in charge of directing the object, and it constantly keeps tabs on
it's status. This is why I want atomic updates on status info.

The workflow system (wfs) is really a Paos server, but it only talks to
user clients. However, it pretends to be a Paos client to communicate with
the Paos server associated with an analysis tool. When sending an object
to be analyzed, the wfs commits the object to the remote (analysis)
server. It also requests notification on all updates to it's status
attributes. The copy of the object local to the wfs is updated with the
remote status info. When analysis is complete, the wfs syncs it's copy
with the remote copy, and then removes the remote copy.

Now, at any time, a user client can access the wfs, and get the status
information from the copy on the wfs. The user will always know where the
wfs is, since it's running locally (either just for that one user, or
maybe a department or university-wide instance).
When an object completes, it gets moved to an archive section of the wfs.
The user client accesses this object via a unique ID.

Since the wfs is networked, the object can be accessed from any Loci user
client. The user just has to know the wfs location and the object ID.

> Hmmm.  Turning the client off and getting the data from another client, means
> the server needs to know the original client is off and that the information
> should be held until the ID is provided.  I think it'll work.  The server may
> keep a copy on file for a time specified by the user.  That way, the server
> doesn't have to probe for the client loci that sent the data.

See above; the wfs doesn't care if the original client is still around. It
just holds onto the object until someone comes along to retrieve it.
The wfs shouldn't seek out the user client. The user client comes to it.
Also, the object ID is provided when the object is first started.
Click "Start Analysis", and the wfs responds with "OK, here's your
ID." The user client should have an option to keep track of those for you,
but the user should also be able to access the object from any other Loci
user client, just using that info.

> Well, again, if we make up our own system it will be less complicated...but
> we'll have more work.

That's probably the next step. Although, I'm curious about your thoughts
on how to use the Paos object. I'm becoming rather fond of the object
representation of the XML format.

Although, I should make sure Paos is capable of this.
Carlos, can it handle complex Python container classes?
And can we update elements in it atomically?

If not, just using separate attributes on the Paos object for the various
components would work (the status, control, query, and data attributes).

Justin Bradford
justin at ukans.edu