[Pipet Devel] WHAX and Loci storage ideas

Sun Dec 5 23:30:16 EST 1999

Brad,

It good to know that someone is thinking about data storage issues for
Loci.  This is an important and (in my personal opinion) underdiscussed
topic.  Let's disscuss some of these ideas now.  For clarity, lets keep
in mind that Loci is constructed in a 'three-tier' architecture:
1. The GUI 'Front-end' with 'bindings' to the 'Middleware'. 
2. The 'Middleware', which is the CORBA, or command line interface, or
http protocol, or whatever is needed to access the 'Back-end'.  These
will be the services that allow the backend to interoperate, as dictated
by the WFD. A 'data translator locus' is a good example of loci
'middleware'.  The database used to store the individual loci contained
within a 'container locus' would be another example.  
3. The Back-end, which are the information repositories (filesystems,
databases, and so on), and the analysis programs that manipulate the
data.  The back-end  likely is diverse, both  architecturally and
geographically.  Note that nowhere in this description is there any
mention of data-type:  Loci can work for physicists as well as it can
for bioinformaticists, but we are all bioinformaticists here, so we
always provide our scenarios (and will use Loci) as a bioinformatics
application. A multiple-alignment program is a good example of a
'back-end' locus.

The back-end 'resources' are the 'loci'.  They are represented by the
icons / nodes in the Front-End, and made interoperable by the
middleware.   The front-end and the back-end dont even know about each
other.  

Although I'm not the absolute authority on Loci's architecture, and the
architecture likely will cotinue to evolve, I'm relatively certain that
this is the current 'Loci architectural paradigm'.  

I'm pretty certain that you already understand this paradigm, but I
thought I should make it explicit for the sake of discussing your ideas
on data storage for Loci.

Brad Chapman wrote:

> WHAX (Warehouse Architechture for XML)
> --------------------------------------
>         Basically, this is a technical document detailing the
> implementation of WHAX. Basically, what WHAX is designed to do is to take
> selected
> information from a data source, which can be either a database or an XML
> document, and represent it as an "XML Warehouse." This XML Warehouse
> contains specific information from a database which has been selected by
> the user. For instance, if you had a database full of books you've read,
> you could create an XML warehouse of all of the books you've read that
> were written by Stephen King. Some key characteristics of an XML Warehouse
> is that it is in XML format and is represented by a  tree structure. So
> based on my limited XML knowledge, this seems analagous to a Document
> Object Model (DOM).
>         What WHAX does is define a method for upkeeping this XML Warehouse.
> The upkeep is unique from upkeep of databases because XML is in a
> semi-structured format--the paper describes it as "self-describing,
> irregular data." That paper details methods for changing the XML warehouse
> when new data is added or removed, and for keeping the warehouse consistent
> with changes in the underlying database where the XML warehouse got its
> information from.

The URL for this document is 
http://db.cis.upenn.edu/cgi-bin/Person.perl?susan

The document title is:  Efficient View Maintenance in XML Data
Warehouses 

> Data Storage in Loci
> --------------------
>         Reading through this document got me thinking about how this could
> be applied to Loci and I came up with the following model of data storage
> in Loci.
> 
> To make things simpler in my head, I split the data storage needs of Loci
> (according to my, hopefully correct!, model of Loci) into three categories:
> 
> 1. The data that comes in as a document (for instance, a set of
>         sequences in FASTA format). These are the input files provided by
> the user.

Or retrieved from a database query, or output by an analysis program.

> 
> 2. The actual setup of a workflow diagram--the underlying structure of the
> diagram (how all of the loci are connected together). This is supplied by
> the user in the workflow diagram by connecting all of the dots together and
> constructing the command-lines (in the words of Jeff!).

This is my understanding as well, although the WFD will be constructed
via a graphical shell, which has a 'thin interface' to the middleware. 
When you say 'constructing the command-lines', do you mean 'generating
the interface to the middleware'?  

> 
> 3. The internal XML warehouse (to use my new WHAX-learned term!). This
> would be a subset of the supplied data (1.) that is passed from loci to
> loci according to the work flow diagram. Jeff describes this very well
> (Data Storage Interfaces--June 11) as an XML document that travels from
> loci to loci and changes XML formats (ie. changes to different document
> structures according to the specific DTD (document type definition) needed
> at that loci).

> 
> Each of these points has a specific storage needs, so I have come up with a
> separate plan for each of them:
> 
> 1. Input Data: Since the user supplied this data, it is their choice to
> determine how they want to deal with it. If they want to store it as a
> backup in a database of some sort, then they can do this through the work
> flow diagram. So the data can be stored in a 'plug-in' database (what Gary
> and Jeff mentioned to be). This type of interface/data storage component
> isn't "essential" to the functioning of Loci, so I will go on to the
> essential data storage needs.

Exactly.  Using Jeff's analogy, what if we were to retrieve an entire 2
Terabyte sequence file, in GenBank format, from the NCBI database, and
wanted to search the entire file against the cDNA for alpha-hemoglobin.
Lets suppose further that we had access to a remote analysis program
running on a fancy supercomputer that did BLAST searches for us and
required GenBank formatted files to perform the search. Suppose further
that the NCBI database and the Supercomputer were on the same machine. 
We could construct a WFD where we retrieve the 2 Terabyte file from NCBI
and 'pipe' it directly to the analysis program, along with our
a-hemoglobin cDNA, and BLAST away. In theory, Loci would send the data
from the database and through the analysis program, possibly without the
data ever touching a network-interface card, and without ever being
reformatted If however, Loci required the data to be reformatted and
stored in an intermediate database, say on my 66Mhz 486 with 400 MB Hard
drive and 4Mb ram, I'd be running for the fire-extinguisher as my cpu
exploded in a core-dumping ball of fire.

On the other hand, what if we planned to do our entire thesis project
based upon the information kept in that 2 Terabyte file? Would we want
to retrieve it from the NCBI database everytime we wanted to do an
analysis on it, especially if we wanted only to search a small segment
of it? No way! we would wan to have that file stored in a fashion
wherein we could easily extract only the parts that we are interested in
performing an analysis on. This is where Loci's ability to store
sequence data in a database becomes important.

> 
> 2. Workflow Data: Loci will need a method to store the user defined
> workflow diagram. This diagram includes: 1. the setup of the workflow
> diagram (how everything is connected together) 2. The constructed command
> line for each program 3. more???. This is the kind of storage need I was
> thinking about when I wrote my incoherent message a couple of days ago
> about trees and graphs. Basically, my thinking is that we can stick all of
> the information from a workflow diagram into a data stucture, and then move
> through this structure in the specified order to execute the contents of
> the workflow diagram. My new data structure of choice is a flow network
> (still from Intro Algorithms). Basically I think each element of network
> would have a setup kind of like the following pseudo-code:
> 
> data-structure loci:
>         array[pointers] TheNextLoci #pointers to the loci which come next in
>                                     #the flow diagram
>         string Type # The loci type
>         string IOName #the program or document represented by the loci
>         tuple CommandLine #all of the command line arguments
>         pointer XMLDocument #the info being processed
>         pointer DTD #the document definition for the particular loci
>         pointer ActionInstructions #a document with what to do at that loci

We still need to formalize the interface to the the command-line-run
backend apps.  but this sounds about right to me.

The OMG LSR ( http://www.omg.org/homepages/lsr/) Biomolecular Sequence
Analysis working group has a nearly complete RFP
(http://www.omg.org/techprocess/meetings/schedule/Biomolecular_Sequ._Analysis_RFP.html)
for sequences and their alignment and annotation.  Loci plans to adopt
their CORBA IDL for passing biomolecular sequence objects to
CORBA-compliant backend apps.  This RFP has 'XML extensions' for future
compatability, btw.

> 
> Of course, this would require each loci to setup a DTD type file that has
> the specifications to create a document for the particular program (I talk
> more about how I think this would work in point 3. below) and also an
> ActionInstruction to determine what to do at that loci (ie. display a pdb
> file in RasMol, align sequences from the XML document etc.).
>         My mental image is that the XML document would move into a
> particular locus, be converted to the DTD required for that particular
> locus, and then processed according to the specifications of the program at
> that locus. I imagine the setup of the DTD and action instructions would be
> part of the plug-in process for each program that needs to read a document
> into or get info from the workflow diagram.

My understanding is that Loci will come with 'data translators'
(middleware) that will be placed between a document / database to
accomodate the formatting requirements of the analysis program that will
operate on the document.

> 
> 3. Internal XML warehouse: My thoughts on this on pretty directly based off
> the WHAX paper. Here is kind of what I imagine happening with a document
> that comes into Loci. First the document will be converted into XML format
> based on the DTD of the locus (ie. the type of data in the document). This
> XML document will then be put into an XML database (Note: This is kind of
> what I was thinking before--have a database to store info instead of a
> specific internal format.) 

I think this is appropriate only for Loci's own internal data
requirements, but violates Loci's 'laissez-faire' paradigm for operating
on 'exogenous' data. Jeff explained to me best when he said that Loci
should be like the Bash shell: the bash shell has redirection operators
and pipes, which you can combine to do some fairly sophisticated data
processing, for example:

bash$ cat /var/adm/messages | grep "root" > /tmp/root.txt

Here bash will pipe the contents of /var/adm/messages to grep, which
will extract all the lines containing the word 'root' and place them in
the /tmp/root.txt file.  Bash itself cares not about the contents of
/var/adm/messages, doesnt reformat it, doesnt store it in an
intermediate database, then re-extract it from the database, reformat it
once again, and finally pump out the /tmp/root.txt file according to
some xml dtd.  Neither should Loci, in its most abstracted form.
Instead,the data conversions and XML operations should be the modular
extensions to Loci that we provide as valuable options for the end-user,
so that Loci becomes not just a graphical 'bash', but a sophisticated
distributed data processing system.  Not that a graphical bash wouldn't
be nice:  the gnome dudes have talked about using Loci's graphical shell
to do just that!  Bottom line:  maximum abstraction + maximum
modularization = maximum flexibility = maximum power!

gary

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Gary Van Domselaar		gvd at redpoll.pharmacy.ualberta.ca
Faculty of Pharmacy 		Phone: (780) 492-4493
University of Alberta		FAX:   (780) 492-5305
Edmonton, Alberta, Canada       http://redpoll.pharmacy.ualberta.ca/~gvd