Oh Great Locians; Hello! I have been doing some more thinking about data storage for loci and along these lines have read up on the WHAX stuff. Below I kind of give a quick overview of WHAX (for those of you who didn't like the looks of the 60 page technical document about it!) based on what I was able to get out of it (since I'm not a database expert). Then I follow up with a plan for data storage for Loci based on WHAX ideas, info from the archives, and my own random ideas. Sorry it's so long, but I would really be interested in hearing everyone's comments if they can make it all of the way though! WHAX (Warehouse Architechture for XML) -------------------------------------- Basically, this is a technical document detailing the implementation of WHAX. Basically, what WHAX is designed to do is to take selected information from a data source, which can be either a database or an XML document, and represent it as an "XML Warehouse." This XML Warehouse contains specific information from a database which has been selected by the user. For instance, if you had a database full of books you've read, you could create an XML warehouse of all of the books you've read that were written by Stephen King. Some key characteristics of an XML Warehouse is that it is in XML format and is represented by a tree structure. So based on my limited XML knowledge, this seems analagous to a Document Object Model (DOM). What WHAX does is define a method for upkeeping this XML Warehouse. The upkeep is unique from upkeep of databases because XML is in a semi-structured format--the paper describes it as "self-describing, irregular data." That paper details methods for changing the XML warehouse when new data is added or removed, and for keeping the warehouse consistent with changes in the underlying database where the XML warehouse got its information from. Data Storage in Loci -------------------- Reading through this document got me thinking about how this could be applied to Loci and I came up with the following model of data storage in Loci. To make things simpler in my head, I split the data storage needs of Loci (according to my, hopefully correct!, model of Loci) into three categories: 1. The data that comes in as a document (for instance, a set of sequences in FASTA format). These are the input files provided by the user. 2. The actual setup of a workflow diagram--the underlying structure of the diagram (how all of the loci are connected together). This is supplied by the user in the workflow diagram by connecting all of the dots together and constructing the command-lines (in the words of Jeff!). 3. The internal XML warehouse (to use my new WHAX-learned term!). This would be a subset of the supplied data (1.) that is passed from loci to loci according to the work flow diagram. Jeff describes this very well (Data Storage Interfaces--June 11) as an XML document that travels from loci to loci and changes XML formats (ie. changes to different document structures according to the specific DTD (document type definition) needed at that loci). Each of these points has a specific storage needs, so I have come up with a separate plan for each of them: 1. Input Data: Since the user supplied this data, it is their choice to determine how they want to deal with it. If they want to store it as a backup in a database of some sort, then they can do this through the work flow diagram. So the data can be stored in a 'plug-in' database (what Gary and Jeff mentioned to be). This type of interface/data storage component isn't "essential" to the functioning of Loci, so I will go on to the essential data storage needs. 2. Workflow Data: Loci will need a method to store the user defined workflow diagram. This diagram includes: 1. the setup of the workflow diagram (how everything is connected together) 2. The constructed command line for each program 3. more???. This is the kind of storage need I was thinking about when I wrote my incoherent message a couple of days ago about trees and graphs. Basically, my thinking is that we can stick all of the information from a workflow diagram into a data stucture, and then move through this structure in the specified order to execute the contents of the workflow diagram. My new data structure of choice is a flow network (still from Intro Algorithms). Basically I think each element of network would have a setup kind of like the following pseudo-code: data-structure loci: array[pointers] TheNextLoci #pointers to the loci which come next in #the flow diagram string Type # The loci type string IOName #the program or document represented by the loci tuple CommandLine #all of the command line arguments pointer XMLDocument #the info being processed pointer DTD #the document definition for the particular loci pointer ActionInstructions #a document with what to do at that loci Of course, this would require each loci to setup a DTD type file that has the specifications to create a document for the particular program (I talk more about how I think this would work in point 3. below) and also an ActionInstruction to determine what to do at that loci (ie. display a pdb file in RasMol, align sequences from the XML document etc.). My mental image is that the XML document would move into a particular locus, be converted to the DTD required for that particular locus, and then processed according to the specifications of the program at that locus. I imagine the setup of the DTD and action instructions would be part of the plug-in process for each program that needs to read a document into or get info from the workflow diagram. 3. Internal XML warehouse: My thoughts on this on pretty directly based off the WHAX paper. Here is kind of what I imagine happening with a document that comes into Loci. First the document will be converted into XML format based on the DTD of the locus (ie. the type of data in the document). This XML document will then be put into an XML database (Note: This is kind of what I was thinking before--have a database to store info instead of a specific internal format.) Then, as you progress through the work-flow diagram, each loci will create an XML warehouse from the XML database based on the DTD requirements of the particular loci. So what I am thinking is that we can use the WHAX system to maintain an XML document that has all of the info needed for a particular locus. For instance, if we come to a processor that requires sequences in the database in FASTA format, we can pull out the sequences and other required info from the database and update the XML warehouse to have this info. So we would maintain a view of the data available in the database and update it for the needs of a locus. Okay, I should stop talking about this point before I get any more confusing! More ranting ---------------------- Basically, I am proposing a plan whereby we eliminate a specific internal storage format and essentially put everything into a database. Of course, this type of plan "requires" a database, and here I was thinking that we could use dbXML (http://www.dbXML.org), mentioned by Jeff in the archives. The database is under a BSD-style license (which I think is compatible with the LGPL) and although it still doesn't "do" anything yet, it is under current development (most recent tarball = November 27th) and we could try and coordinate development with Tom Bradford, the developer there. He is developing it in C++ with a CORBA interface (he is using ORBacus as his ORB), so ultimately the database could also be pluggable (you could use any XML storage database), which fits in well with the Loci schema. The reason that I think this kind of plan is better than an internal format is that it gives us a lot of flexibility to input any kind of information, as Jennifer was talking about. For instance, say we had a program to plug in that uses specific animal descriptors to build an evolutionary tree. So you might have data for an anteater in the input file like: <Claws> Sharp and Pointy </Claws> <Nose> Long </Nose> <Tounge> Really Long </Tounge> (Okay, so I don't know anything about anteaters! Sorry!). With an internal data format, we could have to define a new DTD to include these three elements but with a database format, I don't think this would be necessary. Okay, well basically this is what has been on my mind for the past couple of days and hopefully I've managed to scrape it together in a semi-organized fashion. I would be really interested to hear everyone's comments about the ideas to see if they are along the lines of other peoples' thinking or just really crazy. Also, thank you very much if you read this through all of the way to the end! Brad .