[Pipet Devel] Data Storage Interfaces

Alan J. Williams Alan at TheWilliamsFamily.org
Tue Jun 1 15:16:10 EDT 1999


Hello Locians!

I am working on a simmilar package to Loci called the General Sequence
Toolkit (gst).  Currently I am designing the storage interface for the gst
package.  I thought I'd share some of my ideas on this list for the
benifit of both GST and Loci.  Warning, I am not a computer science
person, but a molecular biology person, so I may miss the obvious (but
hopefully not :o). 

I am loosly defining the "Data Manager" or DM as the interface and backend
or server side code for managing bioinformatics data of various types.
Some of the design goals for the DM:

   1) Allow for extendibility (ie DM plugins)
   2) Simple, general, but sufficient interface
   3) Minimize transfere of un-needed data 
   4) Allow for relocation of data
   5) Allow for read only access
   6) Enable wrappers for common non-xml, non-gst data sets (ie genbank)
   7) Allow multi user access
   8) The interface should not assume anything regarding the data except
      that it is in an XML format.
   9) Enable network transparency
   10) Simple and robust xreferencing
   11) ???

In the most basic sense, GST would not have a DM but rather a DM interface
to DM plug-ins.  Some examples:

   1) Genbank/Entrez DM would consist of a local plug-in that provide the
      program with read only access to NCBI's databases over the internet.
      The non-xml genbank entries would automatically be wrapped/converted
      into xml by the DM.

   2) Intranet Oracle or AceDB database DM would consist of a local
      plug-in (as well as a server for the plug-in possibly). The plug-in
      would handle the network transparency as well as wrapping/converting
      the data to xml. 

   3) A file system DM would consist of a plugin for GST as well as a 
      server for the plug-in.  Transfere of data from the file system to
      GST would be handled by a socket b/w the plugin and the server.
      When a user starts up GST, if a filesystem DM for his/her personal
      GST directory is not running, one is automatically started.  If the
      user is on another computer, they can still access their personal
      GST directory as long as the file system DM is running on the same
      computer as the personal GST directory tree. 

So rather than working with files straight off the filesystem (as in GDE),
a file/data would have to be imported into one of the available DM's
before use in GST.

So with that overall model in mind, there are several issues that I can
think of:

1) Is this a good model?
2) What are the functions that the interface must handle?
3) How will robust cross referencing be done (ie I am working on a project
   and all the sequences are available through the oracle DM and the blast
   results are available through the filesystem DM.)

Thoughts on the cross referencing issue:

Using URI's along with XLinks and Xpointers is probably the best way given
that the data model is centered on XML. Along those lines, the URI might
look like:

       gstoracle://oracle.server.bio.com:85/genbank/pri/U29875

where gstoracle defines the data access method (which DM plugin to use), 
oracle.server.bio.com:85 is the host computer for the data (and port
number), and genbank/pri/U29875 is the remaining portion of the reference
to get to the desired sequence. 

So working from this, we should allow for aliases of the
protocol://hostname:port portion.  This is necessary if we are going to
simplify the process of relocating data.  The question is where and how
should these alias lists be maintained. One possibility is to have an
alias list maintained by each client side gst installation.  Another
posibility is to have the server side of the DM's maintain their own alias
lists.  A third option is to have both the client side installations and
the DM server sides maintain alias lists.  The forth option (which I am
leaning toward) is that each DM would provide it's own alias upon request
and the gst client program when started would update it's own person list
of aliases on startup. This way the DM's could request the alias for cross
references and store just the alias.  example:

      gstfile://server.bio.com/g349f7 would be converted to
      gstserver1/g349f7 before storing this reference.
      
      if the gstserver1 data was then moved to an oracle database 
      on a different computer, the administrator would only need to
      edit the personal lists on the DM and client installations inorder
      to force the initial lookup.  No editing of the actual data would
      be needed.

As for encoding these links, it would make sense to use xlink's to refer
to xml entities and xpointers to reference within entities. In addition,
the DM would be responsible for assigning new ID's so that within each DM
ID's are unique.  Hence the URI for a GST xml resource would include the
ID and would be unique. 

So the next issue is the interface:

This first question in my mind is whether the DM should handle parsing of
the XML files or should it just pass/take whole xml files. The interface
would be simpler if the DM just passed the whole xml file (in other words
the interface deals with xml entities only and not parts of entities).  On
the other hand, if a user just needs one part of the entity, the whole
entity must still be transfered over a potentially slow network
connection.  The alternative is to have the xml parsing take plase on the
DM side of the interface.  This would allow for only the requested element
to be transfered. In addition, only the requested portion from a non xml
data set would have to be wrapped in xml. So despite the additional
complexity, it would seem that having the interface be aware of xml parts
and not just entities would be highly advantageous.

So what are some of the necessary interface functions:

1) Reserve/request unique ID/URI
2) lock/unlock URI
3) Revoke a URI lock
4) Initialize/Login
5) Close/Shutdown
6) Delete Entity/Element
7) Update Entity/Element
8) Add Element
9) Search which returns an XML document with extended xlinks for hits
10) Ability to return browse lists
11) Retrieve alias
12) Retrieve DM Info:
    * Name (URI)
    * Description
    * Read only or Read/Write
    * ???

Sorry about the lengthing posting, but hopefully it wills stimulate some
profitable discussion. 

-Alan

************************************************************************  
Alan Williams           (finger alan at avocado.ucr.edu for pgp public key)
------------------------------------------------------------------------  
University of California, Riverside   "Where observation is concerned,
Dept. of Botany and Plant Sciences     chance favors the prepared mind."  
Alan at TheWilliamsFamily.org                       -- Louis Pasteur
************************************************************************






More information about the Pipet-Devel mailing list