[Pipet Devel] WHAX and Loci storage ideas

Sun Dec 5 13:53:54 EST 1999

Oh Great Locians;
	Hello! I have been doing some more thinking about data storage for
loci and along these lines have read up on the WHAX stuff. Below I kind of
give a quick overview of WHAX (for those of you who didn't like the looks
of the 60 page technical document about it!) based on what I was able to
get out of it (since I'm not a database expert). Then I follow up with a
plan for data storage for Loci based on WHAX ideas, info from the archives,
and my own random ideas. Sorry it's so long, but I would really be
interested in hearing everyone's comments if they can make it all of the
way though!

WHAX (Warehouse Architechture for XML)
--------------------------------------
	Basically, this is a technical document detailing the
implementation of WHAX. Basically, what WHAX is designed to do is to take
selected
information from a data source, which can be either a database or an XML
document, and represent it as an "XML Warehouse." This XML Warehouse
contains specific information from a database which has been selected by
the user. For instance, if you had a database full of books you've read,
you could create an XML warehouse of all of the books you've read that
were written by Stephen King. Some key characteristics of an XML Warehouse
is that it is in XML format and is represented by a  tree structure. So
based on my limited XML knowledge, this seems analagous to a Document
Object Model (DOM).
	What WHAX does is define a method for upkeeping this XML Warehouse.
The upkeep is unique from upkeep of databases because XML is in a
semi-structured format--the paper describes it as "self-describing,
irregular data." That paper details methods for changing the XML warehouse
when new data is added or removed, and for keeping the warehouse consistent
with changes in the underlying database where the XML warehouse got its
information from.

Data Storage in Loci
--------------------
	Reading through this document got me thinking about how this could
be applied to Loci and I came up with the following model of data storage
in Loci.

To make things simpler in my head, I split the data storage needs of Loci
(according to my, hopefully correct!, model of Loci) into three categories:

1. The data that comes in as a document (for instance, a set of
	sequences in FASTA format). These are the input files provided by
the user.

2. The actual setup of a workflow diagram--the underlying structure of the
diagram (how all of the loci are connected together). This is supplied by
the user in the workflow diagram by connecting all of the dots together and
constructing the command-lines (in the words of Jeff!).

3. The internal XML warehouse (to use my new WHAX-learned term!). This
would be a subset of the supplied data (1.) that is passed from loci to
loci according to the work flow diagram. Jeff describes this very well
(Data Storage Interfaces--June 11) as an XML document that travels from
loci to loci and changes XML formats (ie. changes to different document
structures according to the specific DTD (document type definition) needed
at that loci).

Each of these points has a specific storage needs, so I have come up with a
separate plan for each of them:

1. Input Data: Since the user supplied this data, it is their choice to
determine how they want to deal with it. If they want to store it as a
backup in a database of some sort, then they can do this through the work
flow diagram. So the data can be stored in a 'plug-in' database (what Gary
and Jeff mentioned to be). This type of interface/data storage component
isn't "essential" to the functioning of Loci, so I will go on to the
essential data storage needs.

2. Workflow Data: Loci will need a method to store the user defined
workflow diagram. This diagram includes: 1. the setup of the workflow
diagram (how everything is connected together) 2. The constructed command
line for each program 3. more???. This is the kind of storage need I was
thinking about when I wrote my incoherent message a couple of days ago
about trees and graphs. Basically, my thinking is that we can stick all of
the information from a workflow diagram into a data stucture, and then move
through this structure in the specified order to execute the contents of
the workflow diagram. My new data structure of choice is a flow network
(still from Intro Algorithms). Basically I think each element of network
would have a setup kind of like the following pseudo-code:

data-structure loci:
	array[pointers] TheNextLoci #pointers to the loci which come next in
				    #the flow diagram
	string Type # The loci type
	string IOName #the program or document represented by the loci
	tuple CommandLine #all of the command line arguments
	pointer XMLDocument #the info being processed
	pointer DTD #the document definition for the particular loci
	pointer ActionInstructions #a document with what to do at that loci

Of course, this would require each loci to setup a DTD type file that has
the specifications to create a document for the particular program (I talk
more about how I think this would work in point 3. below) and also an
ActionInstruction to determine what to do at that loci (ie. display a pdb
file in RasMol, align sequences from the XML document etc.).
	My mental image is that the XML document would move into a
particular locus, be converted to the DTD required for that particular
locus, and then processed according to the specifications of the program at
that locus. I imagine the setup of the DTD and action instructions would be
part of the plug-in process for each program that needs to read a document
into or get info from the workflow diagram.

3. Internal XML warehouse: My thoughts on this on pretty directly based off
the WHAX paper. Here is kind of what I imagine happening with a document
that comes into Loci. First the document will be converted into XML format
based on the DTD of the locus (ie. the type of data in the document). This
XML document will then be put into an XML database (Note: This is kind of
what I was thinking before--have a database to store info instead of a
specific internal format.) Then, as you progress through the work-flow
diagram, each loci will create an XML warehouse from the XML database based
on the DTD requirements of the particular loci. So what I am thinking is
that we can use the WHAX system to maintain an XML document that has all of
the info needed for a particular locus. For instance, if we come to a
processor that requires sequences in the database in FASTA format, we can
pull out the sequences and other required info from the database and update
the XML warehouse to have this info. So we would maintain a view of the
data available in the database and update it for the needs of a locus.
Okay, I should stop talking about this point before I get any more
confusing!

More ranting
----------------------

Basically, I am proposing a plan whereby we eliminate a specific internal
storage format and essentially put everything into a database. Of course,
this type of plan "requires" a database, and here I was thinking that we
could use dbXML (http://www.dbXML.org), mentioned by Jeff in the archives.
The database is under a BSD-style license (which I think is compatible with
the LGPL) and although it still doesn't "do" anything yet, it is under
current development (most recent tarball = November 27th) and we could try
and coordinate development with Tom Bradford, the developer there. He is
developing it in C++ with a CORBA interface (he is using ORBacus as his
ORB), so ultimately the database could also be pluggable (you could use any
XML storage database), which fits in well with the Loci schema.
The reason that I think this kind of plan is better than an internal format
is that it gives us a lot of flexibility to input any kind of information,
as Jennifer was talking about. For instance, say we had a program to plug
in that uses specific animal descriptors to build an evolutionary tree. So
you might have data for an anteater in the input file like:

<Claws> Sharp and Pointy </Claws>
<Nose> Long </Nose>
<Tounge> Really Long </Tounge>

(Okay, so I don't know anything about anteaters! Sorry!). With an internal
data format, we could have to define a new DTD to include these three
elements but with a database format, I don't think this would be necessary.

Okay, well basically this is what has been on my mind for the past couple
of days and hopefully I've managed to scrape it together in a
semi-organized fashion. I would be really interested to hear everyone's
comments about the ideas to see if they are along the lines of other
peoples' thinking or just really crazy. Also, thank you very much if you
read this through all of the way to the end!

Brad

.