[Pipet Devel] Another XML proposal.

Tue Mar 30 03:19:20 EST 1999

> I'd like to propose that Loci use many small XML DTDs instead of
> trying for a kitchen sink DTD.

I agree, and that is basically the way I had been thinking.
Specific descriptions of data should be as small and modular as possible
(sequence, structure, phylogeny, etc). LocusML should also be able to
describe relationships between those pieces of data, if necessary,
however. We might need specific DTDs for relationships (ie. a restriction
map, which contains a number of short sequence components), as a lot of
relationships will be very hard to express generically.

> we could use a simplified BSML for sequence information (and just
> sequences).

I don't like how BSML is structured, but I do like the detail it allows. I
prefer how the inner sections of BioML for its "flow". I had planned to
merge the two, looking more like BioML but with the versatility of BSML.
Also, BSML doesn't cover amino acid sequences (if I remember correctly),
while BioML does. The two different structures probably merit unique DTDs
anyway, though.

> a different XML dialect could be used for structure information
> (including structural annotations in sequences).

Yes. I'm not sure where to begin on structure. Someone here had ideas
on this, but I'm not sure who or what became of them.

Also, it sounds like we'll need a DTD for phylogeny, too. There are
probably others as well, but the concept remains the same. Describe just
the relevant data, and use a unique ID to find reference and data
relations elsewhere.

> separate XML DTDs could be defined for references, options to be passed
> to a program, and work paths.

A generic reference DTD is fairly simple. Describing relationships between
data will take a little more thought.
Loci specific information will probably be filled in as we go further into
development.

Although, just so no one is confused, the XML format is really only for
transfer (data) and storage (both data and Loci info).
Actual Loci info will be kept in the Paos object as attributes rather than
an XML stream that has to be parsed all the time. It will be written out
to XML for non-Paos storage. 
Generic data will always be handled by the Loci framework as XML (since
it's basically meaningless to it), and the data specific tools will handle
it internally in whatever way is appropriate (hash table, binary tree,
etc).

But a DTD is a good way to describe the data Loci uses.

> I retrieve a nucleotide sequence from Genbank, in genbank format, it
> is parsed into several XML objects: a nucleotide sequence object, several
> bibliographic reference objects, a protein sequence object for the
> "/translation=" feature found in the original genbank file.

Exactly. I believe the translation component is the gatekeeper, in Loci
terminology.

> Each xml object is displayed on the benchtop by the apropriate locus.
> Now I click on the button to perform a restriction map of my sequence.

I haven't thought about the UI much yet.

> The workspace contacts the restriction map locus, which returns an XML
> object describing the parameters and options this restriction map
> locus requires or supports. 

_That_ is an interesting idea. I had just been assuming a generic
interface for types of loci (for example, a restriction map locus has
three arguments and it doesn't vary), but rather than having a bunch of
hardcoded loci types, we can query the locus for it's interface (of course
we'll want to cache interfaces).

> An option handling locus can then prompt
> me for the enzymes I want to cut with, the output format I prefer,
> etc.

Going back to Jeff's idea about embedding python in XML, a locus could
return an interface description with UI code to handle the query
configuration (probably optional for exotic cases; most of the time it
would be generic fields with default UI handlers).

> The restriction map locus can now return the results as several xml
> objects:  a bibliographic reference object describing the algorithm
> used to perform the analysis; a result object containing the requested
> results; a locus object containing the gnome-python source code for
> a gui-locus that can display the results.

Before we go overboard with passing interface code around though, I'd like
to strongly encourage the presence of powerful, high-level widgets in the
workspace app. We don't want to be passing around a generic sequence
viewer all the time.

> The workspace can check if it already has a gui-locus that can display
> the results, and pases the results to it, or downloads the code and
> generates the gui-locus.

Like I said just above, I'd like to see a nice API (from the loci 
perspective) for the UI stuff. Ranging from low-level building block
widgets to higher-level generic viewers, as well as the ability to plug-in
additional generic viewers. That way if you're always using some
non-standard locus gui, you can just load the script locally (and even
replace it with faster compiled code).

> As loci are loaded into the workspace, they can register the ability
> to handle a particular DTD or set of DTDs.

Possibly even more than that -- for instance, a loci to handle a specific
relationship between sets of DTDs (I don't have a good example, though).

> We also need not pass around the entire XML object each time, for
> example only a url for a reference need be included in the results
> from an analysis, not the entire paper.

Yes. It was my intention for the workflow system to just give a locus what
it needs (probably by creating a second Paos object). It should present it
with the necessary data and control information, rather than sending the
whole object with potentially extraneous data and control info.
The locus updates the object with status information (recorded to the
master Paos object, which the gui can get info from). And then transmits
the generated data back via Paos. That's consolidated into the master
object and fed to the gui client.

Also, for the Paos representation of the Loci XML info, I was imagining a
DOM-like interface. The XML is represented in a tree.

So, this Loci info:
<query id="aaaa">
 <action>restriction map</action>
 <option name="distinguish enzyme cuts" value="yes">
 <data argument="template">#sequence_id</data>
 <data argument="restriction enzyme">EcoR1</data>
 <data argument="restriction enzyme">BamH1</data>
</query>

becomes this Paos object:

query.id = "aaaa"
query.action = "restriction map"
query.option{distinguish enzyme cuts} = "yes"
query.data{template} = "#sequence_id"
query.data{restriction enzyme} = "EcoR1"
query.data{restriction enzyme} = "BamH1"

where #sequence_id means the XML can be extracted from the Paos data
attribute under the key "sequence_id"

Or something along these lines. This example is missing a lot of things.
I'm not sure how python handles hashes either. This is actually perl/c-ish
here.

(note: Ok, now I'm going to ramble some...)

Although, perhaps we don't even need to bother trying to express the
internal Loci data stuff as XML. Will we ever need to write it out to XML?
Possibly only the actual biological data needs XML expression, just to
facilitate interaction between Loci derived data and non-Loci tools.
Theoretically, we don't need XML for anything, since structures in Paos
could hold all of the biological data, too. It just seems like a good way
to describe things for stuff that isn't entirely internal to Loci. But on
similar grounds, we will need to define the internal Loci info interface
adequately for tools to make use of it, and perhaps an XML representation
of that would make it more clear.

I'd really like to rig up a working demo.  Does anyone have a pretty
simple analysis tool we could use for an example? In particular, the view
of the resulting data should be simple (that's probably where the most
programming is). Actually a restriction map wouldn't be too bad...

Justin Bradford
justin at ukans.edu