[Pipet Devel] WHAX and Loci storage ideas

Tue Dec 7 09:26:00 EST 1999

Hey Brad!

Having read through most of your message at this point, I want to first rehash
a couple issues about the use of an 'internal format' or 'database':

(1) We have to distinguish between 'data to be processed' and 'workflow
data'.  My objection to a _required_ internal format or database, is for data
to be processed, NOT workflow data.  Of course we need our own system of
handling workflow data, and as Brad suggested, they can be kept in a database.

(2) As for data to be processed (biological/bioinformatics data), we can come
up with our own system, using XML or a database, or whatever.  I just don't
want to _require_ that every bioinformatics datum be converted to that format,
without the user's knowledge.  As Brad says, the user is responsible for
knowing what to do with the data.

(3) Processable data can be encapsulated in the workflow data, providing the
format of the processable data is maintained.  So, if locus represents a FASTA
document, our workflow data should just insert the whole document, unchanged,
between some tags: <document></document>.  Or if a database is used for Loci's
infrastructure and workflow management, the whole document is kept there.  BUT
THE DATA WITHIN THE DOCUMENT IS NOT CHANGED BY LOCI: Only the user can make
the change, and it is done via 'converter' loci.

Brad Chapman wrote:
> 
> To make things simpler in my head, I split the data storage needs of Loci
> (according to my, hopefully correct!, model of Loci) into three categories:
> 
> 1. The data that comes in as a document (for instance, a set of
>         sequences in FASTA format). These are the input files provided by
> the user.

Okay, in this case we're talking about data to be processed.

> 2. The actual setup of a workflow diagram--the underlying structure of the
> diagram (how all of the loci are connected together). This is supplied by
> the user in the workflow diagram by connecting all of the dots together and
> constructing the command-lines (in the words of Jeff!).

This is workflow data.

We're saying the WFD is a graphical script, which is a _script_ nonetheless,
that has to be represented as text (underneath it all) and parsed by an
interpreter (of our own invention) during execution.  It may be obvious to
some here that this is what we're aiming for (a scripting language), but some
may be scared off thinking this is an enormous task.  I like to think that it
is exciting and challenging.

> 3. The internal XML warehouse (to use my new WHAX-learned term!). This
> would be a subset of the supplied data (1.) that is passed from loci to
> loci according to the work flow diagram. Jeff describes this very well
> (Data Storage Interfaces--June 11) as an XML document that travels from
> loci to loci

I think you're talking about workflow data here too.

I learned that 'travel' is not the best word to use here, because it implies
that everything has to be parsed and rewritten (literally moved) between every
locus, even if all loci are on the local system.  Humberto and Justin have
correctly remarked that we want to minimize 'travel' where we can.  In the
case of all local loci, Loci (the program) should 'know' there is no need to
move anything: Everything stays on the local filesystem.

And in most cases, the data accompanying any communication between remote loci
should _point_ (via URI) to where loci (documents, programs, etc.) lie and not
assume the user wants or needs them: The user may already have the locus on
his/her local computer.  Also, since the remote system may be only the first
in a chain/workpath of connected systems, it would be most efficient to have a
pointer to any loci, rather than moving the whole thing across some umpteen
nodes.  IOW, I want the DNA doc on the 13th system I'm connected to.  I can
either make a direct connection to the 13th server via IP, or I can have the
13th send the doc to the 12th, which sends the doc to the 11th, which sends
the doc to the 10th... (Get the picture?)

> and changes XML formats (ie. changes to different document
> structures according to the specific DTD (document type definition) needed
> at that loci).

I'm not sure if you're talking about workflow or processable data here.

> Each of these points has a specific storage needs, so I have come up with a
> separate plan for each of them:
> 
> 1. Input Data: Since the user supplied this data, it is their choice to
> determine how they want to deal with it.

Amen brother!

> If they want to store it as a
> backup in a database of some sort, then they can do this through the work
> flow diagram. So the data can be stored in a 'plug-in' database (what Gary
> and Jeff mentioned to be). This type of interface/data storage component
> isn't "essential" to the functioning of Loci, so I will go on to the
> essential data storage needs.

Something that needs serious thought, however, on the extensions end of this
project.

> 2. Workflow Data: Loci will need a method to store the user defined
> workflow diagram. This diagram includes: 1. the setup of the workflow
> diagram (how everything is connected together) 2. The constructed command
> line for each program 3. more???. This is the kind of storage need I was
> thinking about when I wrote my incoherent message a couple of days ago
> about trees and graphs. Basically, my thinking is that we can stick all of
> the information from a workflow diagram into a data stucture, and then move
> through this structure in the specified order to execute the contents of
> the workflow diagram. My new data structure of choice is a flow network
> (still from Intro Algorithms). Basically I think each element of network
> would have a setup kind of like the following pseudo-code:
> 
> data-structure loci:
>         array[pointers] TheNextLoci #pointers to the loci which come next in
>                                     #the flow diagram
>         string Type # The loci type
>         string IOName #the program or document represented by the loci
>         tuple CommandLine #all of the command line arguments
>         pointer XMLDocument #the info being processed
>         pointer DTD #the document definition for the particular loci
>         pointer ActionInstructions #a document with what to do at that loci

There is some talk about the format of 'workflow data' in the mail archives. 
There were even thoughts that workflow and processable data could be
mixed...which gets back to a required internal data format.

> Of course, this would require each loci to setup a DTD type file that has
> the specifications to create a document for the particular program (I talk
> more about how I think this would work in point 3. below) and also an
> ActionInstruction to determine what to do at that loci (ie. display a pdb
> file in RasMol, align sequences from the XML document etc.).

Hmmm.

>         My mental image is that the XML document would move into a
> particular locus, be converted to the DTD required for that particular
> locus, and then processed according to the specifications of the program at
> that locus. I imagine the setup of the DTD and action instructions would be
> part of the plug-in process for each program that needs to read a document
> into or get info from the workflow diagram.

Oh okay, you're talking about wrapping programs not designed for Loci, to be
used in Loci: workflow data.  As I think you're suggesting, the same wrapping
system should be used for all loci, whether they be data or programs.  To a
large extent, _something_ has to accompany each locus.

> 3. Internal XML warehouse: My thoughts on this on pretty directly based off
> the WHAX paper. Here is kind of what I imagine happening with a document
> that comes into Loci. First the document will be converted into XML format
> based on the DTD of the locus (ie. the type of data in the document). This
> XML document will then be put into an XML database (Note: This is kind of
> what I was thinking before--have a database to store info instead of a
> specific internal format.)

I'm not sure what you mean by 'document'.  I usually use that word for
processable data, but I think you're referring to workflow data.

> Then, as you progress through the work-flow
> diagram, each loci will create an XML warehouse from the XML database based
> on the DTD requirements of the particular loci. So what I am thinking is
> that we can use the WHAX system to maintain an XML document that has all of
> the info needed for a particular locus. For instance, if we come to a
> processor that requires sequences in the database in FASTA format, we can
> pull out the sequences and other required info from the database and update
> the XML warehouse to have this info. So we would maintain a view of the
> data available in the database and update it for the needs of a locus.
> Okay, I should stop talking about this point before I get any more
> confusing!

I think I may need some hand-holding on this.

> More ranting
> ----------------------
> 
> Basically, I am proposing a plan whereby we eliminate a specific internal
> storage format and essentially put everything into a database. Of course,
> this type of plan "requires" a database, and here I was thinking that we
> could use dbXML (http://www.dbXML.org), mentioned by Jeff in the archives.

I'm still not sure if you're suggesting that all processable (bioinformatics)
data be broken up and converted into XML tags.

> The database is under a BSD-style license (which I think is compatible with
> the LGPL)

It is.  BSD allows proprietary/closed-source derivatives of your program,
which I don't like.  But it's not our program anyway.  Providing we can ship
it with Loci, that's all that matters to us.

> and although it still doesn't "do" anything yet, it is under
> current development (most recent tarball = November 27th) and we could try
> and coordinate development with Tom Bradford, the developer there.

Justin had some of his own ideas for an XML database, which he mentions on
this list.  He didn't give any details, so it's not worth searching for.  But
I thought you should know.  Of course, our own would be LGPL'd.

> He is
> developing it in C++ with a CORBA interface (he is using ORBacus as his
> ORB), so ultimately the database could also be pluggable (you could use any
> XML storage database), which fits in well with the Loci schema.

We could use it until something better (uses ORBit, Python, LGPL) comes along.

> The reason that I think this kind of plan is better than an internal format
> is that it gives us a lot of flexibility to input any kind of information,
> as Jennifer was talking about. For instance, say we had a program to plug
> in that uses specific animal descriptors to build an evolutionary tree. So
> you might have data for an anteater in the input file like:
> 
> <Claws> Sharp and Pointy </Claws>
> <Nose> Long </Nose>
> <Tounge> Really Long </Tounge>
> 
> (Okay, so I don't know anything about anteaters! Sorry!). With an internal
> data format, we could have to define a new DTD to include these three
> elements but with a database format, I don't think this would be necessary.

I would consider this for a plug-in database and not mix processable data with
workflow data.

So, are we looking at parallel (two, interconnected) databases?  If someone
wants to use Loci for, say physics, would this be a problem?

Cheers.
Jeff
-- 
                      +----------------------------------+
                      |           J.W. Bizzaro           |
                      |                                  |
                      | http://bioinformatics.org/~jeff/ |
                      |                                  |
                      |           THE OPEN LAB           |
                      |    Open Source Bioinformatics    |
                      |                                  |
                      |    http://bioinformatics.org/    |
                      +----------------------------------+