[Pipet Devel] workflow diagram data model and databases

Fri Jan 7 13:23:48 EST 2000

I'm quite glad you wrote about this! Very timely--I have just committed to
cvs my attempts to start coding the workspace as XML. I was very happy to
see that my attempt to do this is very much in line with your thinking. Let
me just describe really quick what I've coded, and then I'll try and figure
out how that relates to your message.
	The new loci-file has a few directory changes (sorry again!), so it
is probably best to just check out a clean version if you want to look at
it. By typing ./testgui.py & within the main directory, you get the
familiar loci workspace. You can add and link loci as before, but now the
workspace is being represented as xml in the created directory
loci-file/workxml. This directory will be removed if you exit via the menu
buttons (Session->Quit) to prevent the build-up of a lot of crap in the
directory, but you can keep after you quit, if you want to look at it, by
clicking on the close button on the X window (I disconnected this button
from the quit dialog in loci-test for this purpose). Anyways, the xml is
created according to the following plan:

1. When workspace.py starts up (ie. when a new workspace is created):
	a. make a directory: loci-file/workxml/workspace#.
	The number will refer to the number of the workstation being opened.
	Everytime a composite locus is opened, a new directory should be
 	created.
	b. copy the file baselocus.xml to the new directory. This will be the
	overall script for the whole container.
2. When a locus is added to the workspace:
	a. copy the file locustype.xml to loci-file/workxml/workspace#
	modify the name as locustype#.xml
	b. make a xml:link to the locus in baselocus.xml
3. When loci are connected:
	a. modify the two loci to indicate the connection.
	change the input xml:link of the input to point to the output xml file
	change the output xml:link of the output to point to the input xml file
4. When info is added about a locus:
(Note: this has only been sort of implemented for containers)
	a. load the DOM tree of the locus from its xml file
	b. make the modifications
	c. save the resulting xml file

So basically, the overall plan is that every workspace gets a unique
directory created containing baselocus.xml, an xml file with links to each
of the loci in the workspace, and xml files for each loci in the workspace.
	So far, everything seems to work okay from my 4 points, except that
point 4 has only been semi-implemented for containers. You can right
(button 3) click on a container, and you will get an x window with options
for the container. So far, the "set container" and "show contents" buttons
work. When you click on "set container" you get a file-chooser dialog where
you can select a directory for the container to hold. Then the program will
load the xml file for this container, convert it into a DOM tree, add the
container contents to it, and then save the xml file. The "show contents"
button can then be used to retrieve these contents and display them as a
tree.
	This is much like the ugly loci-file window did before, except that
now things are done in DOM trees. Unfortunately, dealing with DOM trees
also has led to a big slow-down in the time it takes to walk through a
directory tree and write it as xml. To sort-of counteract this, the
directory structure will only be parsed to a certain depth (currently it is
set to something like 3). I'll try to think up speed-ups, but dealing with
DOM trees slows things down. Sorry!
	Okay, whew, I think that's it. Let me get into the message!

>today Jeff and I agonized over different methods of storing descriptions
>of the workspace in a database.  This led us to try and develop a data
>model for the workflow diagram, which is no easy task.  the workspace has
>elements of tree-based model,

Right, excellent point! When a workspace/composite loci is created within
another workspace, the newly created workspace directory should be inside
the previous workspace directory. I'll try to make my xml model do this.

>0.  The XML description of the WFD should be modular, but easily portable.

I think making it xml makes it intrinsically portable. Once you create an
xml workspace, you can zip it up (or use an xml compression tool) and send
it around to your hearts content.

>1. The WFD should be constructed from a number of smaller XML documents,
>essentially one per locus in the WFD. If the WFD contains a composite
>Locus, then that locus is itself a pointer to the xml documents contained
>within it.

Right-o. I think I've done this with my baselocus.xml thing. Let me know
your thoughts on whether this satisfies this condition.

>A WFD then should be represented a single database (collection of files).
>The DBMS should be able to manage multiple independent databases.

I think the directory structure that I currently have could be shoved into
a  database in the following way:

directories		-> main databases
xml files 		-> sub-databases within the main database
info in xml files	-> the column/row info within the sub-database

>Connectivity between Loci must be preserved:  If you want to extract a
>subset of loci from a WFD you must first disconnect thost loci from any
>'external' loci, or you must extract the entire superset of connected
>loci along with the selected subset. I hope that makes sense.

Okay. I've connected loci using the xml:link linking language. How does
this sound? Once we get the ability to disconnect links working, I think it
shouldn't be too hard to disconnect the xml:links.

>The DBMS should operate as client/server processes in order to
>accommodate distributed processing requirements.

Do we want to have a DBMS as a client/server process separate from the Loci
client/server stuff, or as a part of it?

>The DBMS should be able to quickly provide an XML description of
>information stored inside the database.

Okay, so we need xml to database and database to xml converters, right?

>Essentially our options, as far as we can see are:
>
>1. Make our own custom database to store our workflow diagram.  This may
>be easier than it sounds because the nature of our data storage needs are
>so unique and specific that trying to write an interface to an existing
>DBMS might be just as hard or harder that writing our own custom loci-db.
>
>2. Use the MySQL database with an XML->SQL->XML interface.  This would
>require some thinking in order to derive a relational data model that can
>accommodate the possibly quite complex Loci WFD.
>
>3.  Use the PostgreSQL Object-Relational database with an XML-SQL-XML
>interface.  I'm not 'up' on how postgres differs from MySQL, but if it
>can more naturally handle objects (loci) and the relationships between
>them  (connections) than this may be a better choice that MySQL.  The
>same considerations exist for creating an intelligent data model for the
>WFD as in option 2.
>
>4. Use an XML database.

Okay, I'll take an early stand on this issue and go straight for point
number 4, specifically using XDBM as our XML database (side note: did we
come to any conclusions about whether we can safely use this?). My
arguments for this:

1. I think it will be *a lot* of work to write a database, or map xml into
a relational database like MySQL/PostgreSQL.
2. I have been looking at XDBM and I really think it does a lot of what we
need (these points are taken from the xdbm documentation)
	a. Provides xdbm2xml and xml2xdbm converters.
	b. Stores the XML in a pre-parsed format so we don't need to go
through entire XML files to find stuff.
	c. You can load only parts of the XML file at a time.
	d. Allows you to stored linked lists (the xml:links, I assume)
	e. Will support DOM complient interfaces.

The disadvantages are that XDBM is brand new and probably still has a lot
of bugs to work out. In addition, the "FreeDOM" interfaces which will
supply the DOM complient interface is still under design/development, and
will require a set of python bindings once they are available.
	Okay, no more writing. Thanks much if you read all of the way here.
I'm looking forward to hearing everyone's thoughts on the new loci-file and
the xml stuff. Thanks again to Jeff and Gary for bringing this up!

Brad