[Pipet Devel] WHAX and Loci storage ideas

Mon Dec 6 20:18:08 EST 1999

Gary et al.;
	Thanks for getting back with me about my data storage thinking! I
think I may have the idea now--so I kind of work through everything in the
rest of this e-mail, and then humbly propose a short-term development
plan(!) just for the sake of argument.

Gary Van Domselaar wrote:
>in mind that Loci is constructed in a 'three-tier' architecture:
>1. The GUI 'Front-end' with 'bindings' to the 'Middleware'.
>2. The 'Middleware', which is the CORBA, or command line interface, or
>3. The Back-end, which are the information repositories (filesystems,
>I'm pretty certain that you already understand this paradigm, but I
>thought I should make it explicit for the sake of discussing your ideas
>on data storage for Loci.

Yeah, I have a firm grasp on the theory but in practice, I know that I have
a lot of difficulty separating Front-End (ie. Loci proper) and Middleware
(ie. plug-ins to Loci). I apologize about that--I know that some of my
thoughts probably reflect my inability to separate these components. I'm
working at it!

Gary Van Domselaar wrote:
>> WHAX (Warehouse Architechture for XML)
>> --------------------------------------
>The URL for this document is
>http://db.cis.upenn.edu/cgi-bin/Person.perl?susan
>
>The document title is:  Efficient View Maintenance in XML Data Warehouses

Thanks, I meant to include that info!

Gary Van Domselaar wrote:
>> 1. The data that comes in as a document (for instance, a set of
>>         sequences in FASTA format). These are the input files provided by
>> the user.
>
>Or retrieved from a database query, or output by an analysis program.

Right-o!

Gary Van Domselaar wrote:
>>
>> 2. The actual setup of a workflow diagram--the underlying structure of the
>> diagram (how all of the loci are connected together). This is supplied by
>> the user in the workflow diagram by connecting all of the dots together and
>> constructing the command-lines (in the words of Jeff!).
>
>This is my understanding as well, although the WFD will be constructed
>via a graphical shell, which has a 'thin interface' to the middleware.
>When you say 'constructing the command-lines', do you mean 'generating
>the interface to the middleware'?

What I think this refers to is generating a command-line for a program by
using a GUI to input all of the switches. For instance, if I were using
program foo that used a -l switch to specify a log file, I would use the
Loci interface to generate the equivalent of 'foo -l /var/mylogfile.' My
thinking was that 'the interface to the middleware' would be worked out
during the programming of the plug-in to work with Loci. For instance, to
get Loci to use my sequence viewer program, I would have to tell it by
writing the plug-in:

1. What kind of file the program needs (ie. PDB, FASTA, etc)
2. How to work the program (ie. the command line stuff: the switches it
takes, etc)

Loci would then take this info and have a GUI for 'constructing the command
line' (getting the switches set up) and do error checking do make sure the
user supplies the right file for the program.
At least, this is my current understanding of how stuff would work

Gary Van Domselaar wrote:

>We still need to formalize the interface to the the command-line-run
>backend apps.  but this sounds about right to me.
>
>The OMG LSR ( http://www.omg.org/homepages/lsr/) Biomolecular Sequence
>Analysis working group has a nearly complete RFP
>(http://www.omg.org/techprocess/meetings/schedule/Biomolecular_Sequ._Analysis_R
>FP.html)
>for sequences and their alignment and annotation.  Loci plans to adopt
>their CORBA IDL for passing biomolecular sequence objects to
>CORBA-compliant backend apps.  This RFP has 'XML extensions' for future
>compatability, btw.

Thanks--I'll take a look at it (whenever I am feeling up to looking at a
huge document with half the lines crossed out!). I just came up with that
"interface" specification off the top of my head--just wanted to make sure
I was on the right track.

Gary Van Domselaar wrote:
>I think this is appropriate only for Loci's own internal data
>requirements, but violates Loci's 'laissez-faire' paradigm for operating
>on 'exogenous' data. Jeff explained to me best when he said that Loci
>should be like the Bash shell: the bash shell has redirection operators
>and pipes, which you can combine to do some fairly sophisticated data
>processing, for example:
>
>bash$ cat /var/adm/messages | grep "root" > /tmp/root.txt
>
>Here bash will pipe the contents of /var/adm/messages to grep, which
>will extract all the lines containing the word 'root' and place them in
>the /tmp/root.txt file.  Bash itself cares not about the contents of
>/var/adm/messages, doesnt reformat it, doesnt store it in an
>intermediate database, then re-extract it from the database, reformat it
>once again, and finally pump out the /tmp/root.txt file according to
>some xml dtd.  Neither should Loci, in its most abstracted form.

I really like the idea of piping! You (and Jeff) are right, there is no
reason to stick stuff in a database if you could just pipe it around.
However, I have a couple of practical questions for using a piping approach
like this:

1. If you have data from a number of sources in a bunch of different
formats, how would you get them together to pipe them into a program that
would require them all in one text document in, say, FASTA format? Would
you have to run each of them through a converter to get them in a common
format, then pipe them all into a processor that would stick them into a
single file?

2. Conversely, what if you had a huge document and wanted to break it up
into smaller documents? For example, what if you had a swiss-prot file and
wanted to get just the protein sequences for all Zea mays (corn)
accessions--how would this be done?

3. How could individual parts of the data be queried or reordered? For
instance, if I wanted to separate all sequences with a particular motif out
of a file and then reorder them by organism.

4. What about doing things like generating GUIs on the fly, as Jeff talked
about  in the 'constructing the command line' mail? He mentioned getting a
pyGTK GUI directly from a Glade output XML document in this case, but
similary, what if we wanted to put the output into a web browser? Would we
convert the file to XML, then process it into HTML/GladeXML and then output
it?

These are just a few concerns I thought up for discussion regarding the
piping system you described. I really like the idea, and think it would be
a more straightforward to do, but my only concern is how well it would
scale as operations got more complicated. I guess I have been thinking of
Loci more as a graphical scripting language, which I imagine having a lot
more options then just a redirection shell.

Gary Van Domselaar wrote:
>Instead,the data conversions and XML operations should be the modular
>extensions to Loci that we provide as valuable options for the end-user,
>so that Loci becomes not just a graphical 'bash', but a sophisticated
>distributed data processing system.  Not that a graphical bash wouldn't
>be nice:  the gnome dudes have talked about using Loci's graphical shell
>to do just that!  Bottom line:  maximum abstraction + maximum
>modularization = maximum flexibility = maximum power!

	You are absolutely right! The best way to combine the piping
backbone with the scripting extensions would be to use a pluggable database
type option (the container) within the pipeline as I was mentioning before.
There I was thinking more in the context of a relational database for long
term storage but now I am thinking more in terms of an XML type database
for stort term storage for Loci's internal data requirements. Alright, yet
another separation between Front-end and Middleware! Sorry that I did not
grasp this sooner!
	So, how does this new paradigm for storage sound?:

1. Front-end: No storage capabilities of its own. Used to organize the
connections to the middleware and pass data around.

2. Middleware--2 storage options:
a. Provide option for XML storage of an "internal XML format." If a user
has a need for more complicated data-handling (as I described in my
questions above), they can utilize this option to place things in an
internal XML database and then use the XML warehouse kind of stuff I
described in point 3 in my last e-mail.
b. Provide an option for permanent storage with relational databases (ie.
MySQL, PostgreSQL, Sybase ...), so that the data can be available after
Loci has quit.

The middleware would handle the connections between the Loci front-end,
which asks for a database or internal format, and the back-end, which
provides it.

3. Back-end: All of the databases themselves.

If this sounds like a plan, then I would like to humbly propose an
immediate development focus: Get the piping stuff working with the Loci
front-end so that we can do something like the following: 1. Input a
sequence in FASTA format 2. Convert it to a new format 3. View it in a
sequence viewer. This type of activity would not require any storage
options, so this would simplify things. In addition, Jeff has the GUI
set-up to make the connections, so we are currently able to construct this
kind of workflow diagram. I think reaching this kind of short term goal
would be extremely exciting as Loci would actually "do" something and would
provide us with a base for further development. How does this sound? Anyone
for this? Hip-hip-hooray? Booooo? Whatta you think?
	Well, if you are to the end again, thank you very much! I would
love to hear comments, etc. Also, I hope I don't step on any toes by making
a development direction suggestion. I just want to get an idea of the short
and long term goals of Loci and kind of find my place somewhere in there so
I can have Loci working for my thesis project needs. Thanks again for
listening!

Brad