[Pipet Devel] Data Storage Interfaces

Mon Jun 21 13:05:14 EDT 1999

<APOLOGY>Sorry for not responding to my own post sooner.  I had to switch
         email addresses and in the process didn't recieve any of the 
         follow-up posts</APOLOGY>

Jeff wrote:
> (2) GST requires Oracle DB; Loci uses PAOS and a simple DB

I guess I didn't present this clear enough.  My thought for GST was to
define an interface for data retrieval and storage and then provide
several standard data manager plug-ins that implement this interface.
Iinitially it would probably be a local filesystem plug-in and a db
plug-in [either mysql or acedb].) In many ways, it sounds simmilar to what
is being talked about for Loci.  I guess the only difference is that the
GST approach would be limited to local plug-in based access (However, a
plug-in that utilized CORBA could be implemented). The advantage from what
I can tell of using a local plug-in approach as opposed to a PAOS approach
is that you are not tied to a specific transfere technology, you are only
limited in that the endpoint/startpoint from the client side must be a
plug-in that implements the DM interface.

Jeff wrote:
> And the GST DM is something like Loci's server/daemon, "Locid".
Sort of (I am probably not being careful about consistently using the term
DM). So the approach would be:
    1) A well defined DM interface (which opperates on the client side)
    2) DM Plug-in(s) which implement the DM Interface (These also opperate
       on the client side)
    3) Depending on the DM Plug-in, various backend or middle end
       "servers" may need to be implemented. 
So with Loci, it sounds like the middle end and backend "servers"
technologies are fixed, where with a plug-in approach only the plug-in
interface in fixed.

Jeff wrote:
> In Loci, a single XML document "travels" a workpath, so everything is 
> done serially (within one path). The document will collect various XML's 

So do you mean that the original data, and analysis results, etc will all
accumulate in one xml document? This doesn't sound like a good idea, but I
am probably misunderstanding what you are saying. Why might this be a bad
idea (I am thinking as I write, so don't flame me too hard ;o):

   1) Complicates data locking in a multi-user model
   2) Increases server load by forcing parsing of un-needed data (See recent 
      post from bioperl with comment on server side XML parsing). If
      parsing isn't done on the server side, then you have the issue of
      having to transfere all that combined data.
It sounds better to me to just implement a robust cross referencing
mechanism, assume each data object to be just one "item" (ie blast
results, or a sequence, or a restriction map). Then let the backend server
store the data as it sees fit (ie as one huge flat file, as individual
database entries, as individual files, ...)

This also simplifies the issue or redoing analysis (which I would assume
would be a common task -- ie monthly blast searches on the same
query sequence). 

Jeff wrote:
> But as far as Loci is concerned, can we make it so that the XML types
> (DTD's) are not hard-coded into Loci?  What if each locus were
> responsible for finding its own XML parser/translator?  That would
> pretty much make Loci a general purpose command-line wrapper.

Definately! To take it to a further extreme, Loci should provide the xml
parsing and a SAX like interface to the data.  Then each loci doesn't have
to implement an XML parser, only the SAX callbacks to handle the XML data.
Then loci could handle any XML data for which a locus is provided and it
could even handle data in the absence of a DTD. (The advantage of the SAX
interface over passing the whole tree, is that memory isn't wasted on
building parts of the tree that aren't needed.) I am not really familiar
with the SAX interface, but hints could be provided so that only elements
the locus wants are passed through the SAX callbacks.

Justin's comments are basically what I was thinking.  Rather than passing
a program all the data, just pass it a reference (maybe an XLink/XPointer)
to the input data. Under the plug-in model, plug-ins for networked data
could implement a cache mechanism to speed up access.

-Alan

************************************************************************  
Alan Williams           
------------------------------------------------------------------------  
University of California, Riverside   "Where observation is concerned,
Dept. of Botany and Plant Sciences     chance favors the prepared mind."  
Alan at TheWilliamsFamily.org                         -- Louis Pasteur
************************************************************************