Brad, It good to know that someone is thinking about data storage issues for Loci. This is an important and (in my personal opinion) underdiscussed topic. Let's disscuss some of these ideas now. For clarity, lets keep in mind that Loci is constructed in a 'three-tier' architecture: 1. The GUI 'Front-end' with 'bindings' to the 'Middleware'. 2. The 'Middleware', which is the CORBA, or command line interface, or http protocol, or whatever is needed to access the 'Back-end'. These will be the services that allow the backend to interoperate, as dictated by the WFD. A 'data translator locus' is a good example of loci 'middleware'. The database used to store the individual loci contained within a 'container locus' would be another example. 3. The Back-end, which are the information repositories (filesystems, databases, and so on), and the analysis programs that manipulate the data. The back-end likely is diverse, both architecturally and geographically. Note that nowhere in this description is there any mention of data-type: Loci can work for physicists as well as it can for bioinformaticists, but we are all bioinformaticists here, so we always provide our scenarios (and will use Loci) as a bioinformatics application. A multiple-alignment program is a good example of a 'back-end' locus. The back-end 'resources' are the 'loci'. They are represented by the icons / nodes in the Front-End, and made interoperable by the middleware. The front-end and the back-end dont even know about each other. Although I'm not the absolute authority on Loci's architecture, and the architecture likely will cotinue to evolve, I'm relatively certain that this is the current 'Loci architectural paradigm'. I'm pretty certain that you already understand this paradigm, but I thought I should make it explicit for the sake of discussing your ideas on data storage for Loci. Brad Chapman wrote: > WHAX (Warehouse Architechture for XML) > -------------------------------------- > Basically, this is a technical document detailing the > implementation of WHAX. Basically, what WHAX is designed to do is to take > selected > information from a data source, which can be either a database or an XML > document, and represent it as an "XML Warehouse." This XML Warehouse > contains specific information from a database which has been selected by > the user. For instance, if you had a database full of books you've read, > you could create an XML warehouse of all of the books you've read that > were written by Stephen King. Some key characteristics of an XML Warehouse > is that it is in XML format and is represented by a tree structure. So > based on my limited XML knowledge, this seems analagous to a Document > Object Model (DOM). > What WHAX does is define a method for upkeeping this XML Warehouse. > The upkeep is unique from upkeep of databases because XML is in a > semi-structured format--the paper describes it as "self-describing, > irregular data." That paper details methods for changing the XML warehouse > when new data is added or removed, and for keeping the warehouse consistent > with changes in the underlying database where the XML warehouse got its > information from. The URL for this document is http://db.cis.upenn.edu/cgi-bin/Person.perl?susan The document title is: Efficient View Maintenance in XML Data Warehouses > Data Storage in Loci > -------------------- > Reading through this document got me thinking about how this could > be applied to Loci and I came up with the following model of data storage > in Loci. > > To make things simpler in my head, I split the data storage needs of Loci > (according to my, hopefully correct!, model of Loci) into three categories: > > 1. The data that comes in as a document (for instance, a set of > sequences in FASTA format). These are the input files provided by > the user. Or retrieved from a database query, or output by an analysis program. > > 2. The actual setup of a workflow diagram--the underlying structure of the > diagram (how all of the loci are connected together). This is supplied by > the user in the workflow diagram by connecting all of the dots together and > constructing the command-lines (in the words of Jeff!). This is my understanding as well, although the WFD will be constructed via a graphical shell, which has a 'thin interface' to the middleware. When you say 'constructing the command-lines', do you mean 'generating the interface to the middleware'? > > 3. The internal XML warehouse (to use my new WHAX-learned term!). This > would be a subset of the supplied data (1.) that is passed from loci to > loci according to the work flow diagram. Jeff describes this very well > (Data Storage Interfaces--June 11) as an XML document that travels from > loci to loci and changes XML formats (ie. changes to different document > structures according to the specific DTD (document type definition) needed > at that loci). > > Each of these points has a specific storage needs, so I have come up with a > separate plan for each of them: > > 1. Input Data: Since the user supplied this data, it is their choice to > determine how they want to deal with it. If they want to store it as a > backup in a database of some sort, then they can do this through the work > flow diagram. So the data can be stored in a 'plug-in' database (what Gary > and Jeff mentioned to be). This type of interface/data storage component > isn't "essential" to the functioning of Loci, so I will go on to the > essential data storage needs. Exactly. Using Jeff's analogy, what if we were to retrieve an entire 2 Terabyte sequence file, in GenBank format, from the NCBI database, and wanted to search the entire file against the cDNA for alpha-hemoglobin. Lets suppose further that we had access to a remote analysis program running on a fancy supercomputer that did BLAST searches for us and required GenBank formatted files to perform the search. Suppose further that the NCBI database and the Supercomputer were on the same machine. We could construct a WFD where we retrieve the 2 Terabyte file from NCBI and 'pipe' it directly to the analysis program, along with our a-hemoglobin cDNA, and BLAST away. In theory, Loci would send the data from the database and through the analysis program, possibly without the data ever touching a network-interface card, and without ever being reformatted If however, Loci required the data to be reformatted and stored in an intermediate database, say on my 66Mhz 486 with 400 MB Hard drive and 4Mb ram, I'd be running for the fire-extinguisher as my cpu exploded in a core-dumping ball of fire. On the other hand, what if we planned to do our entire thesis project based upon the information kept in that 2 Terabyte file? Would we want to retrieve it from the NCBI database everytime we wanted to do an analysis on it, especially if we wanted only to search a small segment of it? No way! we would wan to have that file stored in a fashion wherein we could easily extract only the parts that we are interested in performing an analysis on. This is where Loci's ability to store sequence data in a database becomes important. > > 2. Workflow Data: Loci will need a method to store the user defined > workflow diagram. This diagram includes: 1. the setup of the workflow > diagram (how everything is connected together) 2. The constructed command > line for each program 3. more???. This is the kind of storage need I was > thinking about when I wrote my incoherent message a couple of days ago > about trees and graphs. Basically, my thinking is that we can stick all of > the information from a workflow diagram into a data stucture, and then move > through this structure in the specified order to execute the contents of > the workflow diagram. My new data structure of choice is a flow network > (still from Intro Algorithms). Basically I think each element of network > would have a setup kind of like the following pseudo-code: > > data-structure loci: > array[pointers] TheNextLoci #pointers to the loci which come next in > #the flow diagram > string Type # The loci type > string IOName #the program or document represented by the loci > tuple CommandLine #all of the command line arguments > pointer XMLDocument #the info being processed > pointer DTD #the document definition for the particular loci > pointer ActionInstructions #a document with what to do at that loci We still need to formalize the interface to the the command-line-run backend apps. but this sounds about right to me. The OMG LSR ( http://www.omg.org/homepages/lsr/) Biomolecular Sequence Analysis working group has a nearly complete RFP (http://www.omg.org/techprocess/meetings/schedule/Biomolecular_Sequ._Analysis_RFP.html) for sequences and their alignment and annotation. Loci plans to adopt their CORBA IDL for passing biomolecular sequence objects to CORBA-compliant backend apps. This RFP has 'XML extensions' for future compatability, btw. > > Of course, this would require each loci to setup a DTD type file that has > the specifications to create a document for the particular program (I talk > more about how I think this would work in point 3. below) and also an > ActionInstruction to determine what to do at that loci (ie. display a pdb > file in RasMol, align sequences from the XML document etc.). > My mental image is that the XML document would move into a > particular locus, be converted to the DTD required for that particular > locus, and then processed according to the specifications of the program at > that locus. I imagine the setup of the DTD and action instructions would be > part of the plug-in process for each program that needs to read a document > into or get info from the workflow diagram. My understanding is that Loci will come with 'data translators' (middleware) that will be placed between a document / database to accomodate the formatting requirements of the analysis program that will operate on the document. > > 3. Internal XML warehouse: My thoughts on this on pretty directly based off > the WHAX paper. Here is kind of what I imagine happening with a document > that comes into Loci. First the document will be converted into XML format > based on the DTD of the locus (ie. the type of data in the document). This > XML document will then be put into an XML database (Note: This is kind of > what I was thinking before--have a database to store info instead of a > specific internal format.) I think this is appropriate only for Loci's own internal data requirements, but violates Loci's 'laissez-faire' paradigm for operating on 'exogenous' data. Jeff explained to me best when he said that Loci should be like the Bash shell: the bash shell has redirection operators and pipes, which you can combine to do some fairly sophisticated data processing, for example: bash$ cat /var/adm/messages | grep "root" > /tmp/root.txt Here bash will pipe the contents of /var/adm/messages to grep, which will extract all the lines containing the word 'root' and place them in the /tmp/root.txt file. Bash itself cares not about the contents of /var/adm/messages, doesnt reformat it, doesnt store it in an intermediate database, then re-extract it from the database, reformat it once again, and finally pump out the /tmp/root.txt file according to some xml dtd. Neither should Loci, in its most abstracted form. Instead,the data conversions and XML operations should be the modular extensions to Loci that we provide as valuable options for the end-user, so that Loci becomes not just a graphical 'bash', but a sophisticated distributed data processing system. Not that a graphical bash wouldn't be nice: the gnome dudes have talked about using Loci's graphical shell to do just that! Bottom line: maximum abstraction + maximum modularization = maximum flexibility = maximum power! gary -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Gary Van Domselaar gvd at redpoll.pharmacy.ualberta.ca Faculty of Pharmacy Phone: (780) 492-4493 University of Alberta FAX: (780) 492-5305 Edmonton, Alberta, Canada http://redpoll.pharmacy.ualberta.ca/~gvd