[Pipet Devel] Data Storage Interfaces

J.W. Bizzaro bizzaro at bc.edu
Fri Jun 11 14:48:46 EDT 1999


Hi Alan.

I'm sorry your message went unanswered for a while.  I consider myself lucky
when someone answers me on this list :-)

Some of the terminology you use here is a little unclear to me.  And I've got to
say that it is a little difficult for me to get into a design so different from
Loci, since I have spent so much time trying to refine the Loci design.  IOW,
I'm happy with Loci's design, and I'd probably just recommend the same thing to
you ;-)

We've discussed licensing, which is a basic distinction between GST and Loci. 
But what are some other differences?  From what you have told me, I've come up
with the following:

(1) GST uses C; Loci uses Python
(2) GST requires Oracle DB; Loci uses PAOS and a simple DB

However, both GST and Loci use XML extensively.  And the GST DM is something
like Loci's server/daemon, "Locid".  I don't see many differences there.

> In the most basic sense, GST would not have a DM but rather a DM interface
> to DM plug-ins.  Some examples:

Hmmm.  But the DM's you listed are required, right?  Otherwise there would be no
XML parsing, file system access, etc.

> 1) Is this a good model?

It seems pretty good to me :-)  Are you planning on using an object server, like
CORBA?  How do you hope to get multiple components to communicate independently?

> 3) How will robust cross referencing be done (ie I am working on a project
>    and all the sequences are available through the oracle DM and the blast
>    results are available through the filesystem DM.)

In Loci, a single XML document "travels" a workpath, so everything is done
serially (within one path).  The document will collect various XML's (different
DTD's) along the path, some of which will be understood by certain loci, some of
which will be understood by others.

> Using URI's along with XLinks and Xpointers is probably the best way given
> that the data model is centered on XML. Along those lines, the URI might
> look like:
> 
>        gstoracle://oracle.server.bio.com:85/genbank/pri/U29875
> 
> where gstoracle defines the data access method (which DM plugin to use),
> oracle.server.bio.com:85 is the host computer for the data (and port
> number), and genbank/pri/U29875 is the remaining portion of the reference
> to get to the desired sequence.

This is something like the scheme Humberto came up with for obtaining widgets
that were needed but not present on the user's machine.  The remote locus
returns an XML with the results of an analysis and specifies the best "viewer"
to examine the result graphically, and even where to get the widgets.  If the
user doesn't have the vidgets, a dialog asks if they would like to get them from
whatever URL.

> So working from this, we should allow for aliases of the
> protocol://hostname:port portion.  This is necessary if we are going to
> simplify the process of relocating data.  The question is where and how

This may be akin to what Justin recently called an "address book".  Loci has to
keep track of what loci are available, locally and remotely, including version
numbers.

> As for encoding these links, it would make sense to use xlink's to refer
> to xml entities and xpointers to reference within entities. In addition,
> the DM would be responsible for assigning new ID's so that within each DM
> ID's are unique.  Hence the URI for a GST xml resource would include the
> ID and would be unique.

Yeah.  Each XML object in Loci will need a unique ID.  This should help with
cross referencing between different XML's.  But again, the XML's will all be in
the same document, unless there is a good reason to split the document up (e.g.
an analysis could be performed on a subset of the biodata, rather than the whole
thing).

> This first question in my mind is whether the DM should handle parsing of
> the XML files or should it just pass/take whole xml files. The interface

Heh.  That's an issue we've been discussing for a while.  We're using PAOS to
pass Python objects, but PAOS doesn't handle XML.  We do need XML, but can we
use XML and do without PAOS?  What if PAOS included an XML parser?  That would
make things simpler.  PAOS is actually good for letting the user graphically
monitor the progress of analyses.  XML is of course best for data management. 
We're working on a system that uses both.

But as far as Loci is concerned, can we make it so that the XML types (DTD's)
are not hard-coded into Loci?  What if each locus were responsible for finding
its own XML parser/translator?  That would pretty much make Loci a general
purpose command-line wrapper.  Just thinking out loud.

> would be simpler if the DM just passed the whole xml file (in other words
> the interface deals with xml entities only and not parts of entities).  On
> the other hand, if a user just needs one part of the entity, the whole
> entity must still be transfered over a potentially slow network
> connection.

That's a good point.  We haven't considered slow networks or very large files.

> The alternative is to have the xml parsing take plase on the
> DM side of the interface.  This would allow for only the requested element
> to be transfered. In addition, only the requested portion from a non xml
> data set would have to be wrapped in xml. So despite the additional
> complexity, it would seem that having the interface be aware of xml parts
> and not just entities would be highly advantageous.

Hmmm.  I'd like to hear what Justin, Humberto et al think about that.


:-)
Jeff
-- 
J.W. Bizzaro                  mailto:bizzaro at bc.edu
Boston College Chemistry      http://www.uml.edu/Dept/Chem/Bizzaro/
--




More information about the Pipet-Devel mailing list