[Pipet Devel] Tools for parsing XML with Python

J.W. Bizzaro bizzaro at bc.edu
Mon Sep 27 13:42:27 EDT 1999


Humberto Ortiz Zuazaga wrote:
> 
> No, an xml databse is to avoid filling a database with blobs. It instead uses
> the xml tags to set up the fields for a database automagically. We also want
> the database to be able to efficiently answer queries like "show me all the
> transmembrane domains of the following proteins".

Again, I think we need a database that can facilitate queries of just about any
kind.  So, for example:

    DATABASE LOCUS -----> "Show me all the transmembrane -----> DATA LOCUS
                            domains of the proteins
                            in the DB."

> We'll need a number of features, including a container for files. This kind of
> container is the simplest to set up, and requires the least amount of effort
> on the biologists part to understand.

Yep.

> We'll also want containers that can
> parse fasta format files, genbank files etc.

This would depend on the query mechanism (locus) used.  But the DB should be
generic enough to handle just about anything.

> Another simple container is a
> list of links, where a xml file contains a list of references to other loci.

This may just be a matter of implementation: The DB is a list of links to files
on the file system?

> This is a separate issue: loci needs a way to reference objects without
> copying them. So if I have a file with the genbank entry for X5559999 already
> on my hard drive, I can include this in an analysis by refering to (say)
> file:~/loci/containers/genbank/x5559999. This scheme should be able to handle
> live network queries as well (with a URI like genbank:x5559999?).

Yes, this would be the case for both data (e.g., protein structure) and
processor (e.g., find lowest energy conformation of protein via molecular
dynamics).

> In the best possible world, asking for x5559999 should retreive it from
> genbank the first time, then cache it and transparently refer to the local
> copy for every subsequent reference until it expires from the loci cache. This
> is where we really need a database, to keep track of the source and current
> location of any locus. The database may also serve as primary storage for
> localy defined loci, where loci is used here in it's most general form,
> refering to sequence data, results, programs, user interfaces, etc.

I couldn't have said it better myself.  We want _minimal_ transfer of
information.

> A third kind of "file" we need to deal with is a compund document, say a
> figbuilder page with a multiple sequence alignment and a 3D structure. If we
> wanted to send this figure to a colaborator we could send it as a database
> containing the figure elements, as a gzip'ed tar file with the individual xml
> files, or as a single xml file containing all the required data and layout.

This is something that will take a lot of thought.  I called this compound
document a 'composite locus', meaning you can turn a an entire section of a
Workflow Diagram (containing loci) into a single locus.  But when that composite
locus is sent to someone else, do we send just the links or the actual
data/processor?  If we send just the links and one locus resides (links to
actual data/processor) on the sender's computer, that data/processor has to be
accessible to the recipient via link, otherwise the data/processor has to be
sent in its entirety.  But this is what you wrote below.

> In any case, while we build this figure, the pieces may come from different
> sources: several genbank sequences cached in the DM, two ascii files with
> sequence info in a "container directory", the standard loci for sequence
> alignment editor and PDB structure viewer, etc. Then clicking on the "export
> for collaborator" button would build a file we can mail or publish by bringing
> together all the data, or perhaps just the data that's local leaving the
> exported file with references to the standard items (genbank ID's, standard
> loci, PDB entrys).

Ditto.


Cheers.
Jeff
-- 
                         +----------------------------+
                         |        J.W. Bizzaro        |
                         |  jeff at bioinformatics.org   |
                         |                            |
                         |        THE OPEN LAB        |
                         | Open Source Bioinformatics |
                         |                            |
                         | http://bioinformatics.org/ |
                         +----------------------------+




More information about the Pipet-Devel mailing list