[Pipet Devel] Loci and data storage

Wed Sep 22 23:58:42 EDT 1999

Gary Van Domselaar wrote:
> 
> I dont think this aspect of the loci core has been very thoroughly
> addressed.  Does anyone have any ideas on how we might implement data
> storage for Loci?

In early conversations, we realized the need to split the data into two basic
types according to how they would be managed:

  (1) Data kept (as XML?) on the filesystem: mostly for storage; data
      are not being passed via the Loci system

  (2) Data kept as (CORBA) objects: data that are being passed via Loci

Alan then proposed the concept of pluggable/modular 'data managers'.  A DM would
manage data of any specific type and is pretty much synonymous to what I have
been calling a 'translator', which converts data from one format to another,
plus the underlying infrastructure (what will actually be handled by CORBA). 
Here is an excerpt from Alan's e-mail on June 1 (this is in the archive; GST
refers to General Sequence Toolkit):

---------------
[snip]
I am loosly defining the "Data Manager" or DM as the interface and backend
or server side code for managing bioinformatics data of various types.
Some of the design goals for the DM:

   1) Allow for extendibility (ie DM plugins)
   2) Simple, general, but sufficient interface
   3) Minimize transfere of un-needed data 
   4) Allow for relocation of data
   5) Allow for read only access
   6) Enable wrappers for common non-xml, non-gst data sets (ie genbank)
   7) Allow multi user access
   8) The interface should not assume anything regarding the data except
      that it is in an XML format.
   9) Enable network transparency
   10) Simple and robust xreferencing
   11) ???

In the most basic sense, GST would not have a DM but rather a DM interface
to DM plug-ins.  Some examples:

   1) Genbank/Entrez DM would consist of a local plug-in that provide the
      program with read only access to NCBI's databases over the internet.
      The non-xml genbank entries would automatically be wrapped/converted
      into xml by the DM.

   2) Intranet Oracle or AceDB database DM would consist of a local
      plug-in (as well as a server for the plug-in possibly). The plug-in
      would handle the network transparency as well as wrapping/converting
      the data to xml. 

   3) A file system DM would consist of a plugin for GST as well as a 
      server for the plug-in.  Transfere of data from the file system to
      GST would be handled by a socket b/w the plugin and the server.
      When a user starts up GST, if a filesystem DM for his/her personal
      GST directory is not running, one is automatically started.  If the
      user is on another computer, they can still access their personal
      GST directory as long as the file system DM is running on the same
      computer as the personal GST directory tree. 
[snip]
---------------

There is very little difference between what Alan is talking about here (except
we are using CORBA and leaning away from making our own bio-XML), and in fact
much of the most recent design for Loci comes from Alan's description of the
GST.

> I'm no database expert, so I'm a little hesitant to
> suggest how we should go about it, but it does seem important to me that
> loci should be able to store analysis results in a relational database
> (Informax's VectorNTI uses a relational database to store its data).

Hmmm.  That is providing (1) there are numerous analysis results to be stored
and 'related', and (2) the user needs to store the results this way.  Are you
suggesting this as an option or as a standard way of storing everything?

> This
> would facilitate the construction of customized, sharable databases.  In
> keeping with Loci's philosophy of not adopting specific data formats, it
> seems to me that Loci should probably not adopt a single database, but
> rather have the capability to interface with any of the popular
> databases, such as oracle, sql, mysql, etc.  PHP has this capability, and
> is one of the biggest reasons why it is so successful.

Right.  I think that is exactly what Alan was suggesting with the data manager
proposal.  But again, I'm not sure everything has to be put in some database. 
What do you guys think?

Cheers.
Jeff
-- 
                         +----------------------------+
                         |        J.W. Bizzaro        |
                         |  jeff at bioinformatics.org   |
                         |                            |
                         |        THE OPEN LAB        |
                         | Open Source Bioinformatics |
                         |                            |
                         | http://bioinformatics.org/ |
                         +----------------------------+