[Pipet Devel] and still more infrastructure things

Sun Feb 28 05:13:19 EST 1999

Okay, so what you once considered the responsibility of the Benchtop/GCL, you
now consider that of the wfs.

So, I'll try to look at the XML as an object rather than a file this time.  And
wfs launches the apps, not individual loci/clients.

Justin Bradford wrote:

> Generated data should also return information about the analysis process,
> like the algorithm used, statistical probabilities, etc.

Of course we should make a sharp division at the start between data that is
biological and data that is for the workflow system.  I even imagine the very
top of the file/object to be all workflow stuff.

> Now that is just the "data" section. A LociML file will have a variety of
> additional information as well. We'll probably need control, status, and
> query sections, too.
> Control has to describe the analysis pathway.

...description of the whole pathway

> Status is information concerning the data returned at each analysis step.

...what was collected along the way

> Query has to hold the actual query at each step.

...what still needs to be collected

> Now, the control section is fairly straightforward, as is the status
> section, although both will need to be fairly flexible. Incidental
> information concerning an analysis that might be useful to the client. I
> don't really have any good examples, but I imagine some will come up.

Status should contain the "log" of the analyses.  Status will say what control
says, among other things, when the final destination is reached.  So, at the
final destination, control is irrelevant.

> The query section is more complex, but here's my idea:
> When the user creates the analysis pathway, all of query commands are
> generated at that time as well, but it can make use of variables
> referencing data from queries in earlier stages. The workflow system will
> fill in the variables for a query before sending it off for that analysis.

Sure.  IOW, the query section is dynamic.

> Here is a crude example:
> 
> <control current="1">
>  <step stage="1" server="paosp://some.host/whatever"/ id="q1">
>  <step stage="2" server="paosp://foo/" id="q2">
> </control>
> <status>
>  <step id="q1" state="processing">
>   <message>Analyzing sequence...</message>
>  </step>
> </status>

So status is reported back, via wfs, to the Benchtop, a la my previous e-mail. 
Good.

> <data>
>  <step id="q1">
>   <protein>lots of other stuff here</protein>
>   <!-- note: this wouldn't show up until after the q1 step is done -->
>  </step>
> </data>
> 
> <query>
>  <step id="q1">
>   <data>
>    <dna id="aaa">....</dna>
>   </data>
>   <operation type="translate">
>    <input id="aaa"/>
>    <some other data for translation>
>   </operation>
>  </step>
>  <step id="q2">
>   <data>
>    <protein id="bbb">
>     <variable>data.step[q1].protein.*</variable>
>    </protein>
>    <protein id="ccc">...</protein>
>   </data>
>   <operation type="homology">
>    <input id="bbb"/>
>    <input id="ccc"/>
>    <other stuff for query>
>   </operation>
>  </step>
> </query>

Nice.  But how will Paos handle this?  Are we looking at some major changes to
Paos itself?

> Also, there's no particular reason there couldn't be multiple entries for
> a stage.

stage == step?  Or I guess a step can contain different stages...

> That's why I defined every component of a query by an id, rather
> than by it's stage. Since the first few steps of an analysis pathway
> might not depend on previous data, we could have multiple steps occuring
> simultaneously. There's no reason for all of the steps to be sequential.

Right.  That'd save time, but be difficult to manage.  Now we're talking about
concurrency.

> This would be especially true of a pathway which had a number of database
> queries. Actually, we could probably get rid of the whole ordering thing
> completely, since the wfs could just figure out dependencies by the
> variable references in the queries. Of course, the interface for this
> could be more complicated...

Hmmm.  Now are we dealing with the whole forking/sewing issue here?  Once an XML
object is split up, will it have to be put back together again?

> Also, it probably makes more sense to move all of input data into the
> data section, and have the query reference it there. Also, the format
> of specifying variables and input in general will probably need to be
> improved.

I was thinking about keeping workflow data together.

Also, ID numbers could be longer and randomly generated.

> In terms of implementation, I imagine it would work like this:
> The wfs identifies queries it can currently run

How?  By the database of available loci/clients?

> and creates a
> Paos object on the specified server

...via Porta Internet or whatever, as long as it appears transparent.

> giving it only the portions
> of the xml file necessary for it to run (query and relevant data
> sections).

Yeah, this is where I see Porta Internet or Gatekeeper filtering out stuff the
server-side algorithms/databases don't need.

> The input data goes into one attribute of the Paos object.
> The remote analysis system creates a second attribute containing for
> status tags, and when it's complete, it creates an output section
> with it's new data.

Okay.

> The wfs can frequently grab the status attribute
> on the object, since it's small, and update it's local copy for any
> clients who want to know what is going on.

Yes.  Wonderful!

> When the analysis is
> complete, the wfs grabs the output attribute off of the remote
> object and updates it's copy, and moves on. The remote analysis
> system just drops it's object once it has been acknowledged by the
> wfs.

Okay.

> Any thoughts on the markup language, the query syntax, variable
> references, asynchronous analyses, or the workflow system (wfs, if
> you were wondering what I was refering to)?

Just work with Konrad on the markup of structure.

> I'll start my BioML,
> BSML, PDB merger/implementation/cleanup. Once we agree on how Loci
> works underneath, a rough wfs/paos/gatekeeper system can be set up
> fairly quickly. Then just a quick python wrapper around some analysis
> tool and a simple viewer program will give us a functioning system
> (not a particularly easy to use system, but functioning
> nonetheless).

I'm glad you think this will go quickly.  Are you able to work with Paos as it
is, or will Carlos need to make changes?  How comfortable are you with the
Python?

Buh-bye!
Jeff
bizzaro at bc.edu