[Pipet Devel] Tools for parsing XML with Python
J.W. Bizzaro
bizzaro at bc.edu
Mon Sep 27 13:42:27 EDT 1999
Humberto Ortiz Zuazaga wrote:
>
> No, an xml databse is to avoid filling a database with blobs. It instead uses
> the xml tags to set up the fields for a database automagically. We also want
> the database to be able to efficiently answer queries like "show me all the
> transmembrane domains of the following proteins".
Again, I think we need a database that can facilitate queries of just about any
kind. So, for example:
DATABASE LOCUS -----> "Show me all the transmembrane -----> DATA LOCUS
domains of the proteins
in the DB."
> We'll need a number of features, including a container for files. This kind of
> container is the simplest to set up, and requires the least amount of effort
> on the biologists part to understand.
Yep.
> We'll also want containers that can
> parse fasta format files, genbank files etc.
This would depend on the query mechanism (locus) used. But the DB should be
generic enough to handle just about anything.
> Another simple container is a
> list of links, where a xml file contains a list of references to other loci.
This may just be a matter of implementation: The DB is a list of links to files
on the file system?
> This is a separate issue: loci needs a way to reference objects without
> copying them. So if I have a file with the genbank entry for X5559999 already
> on my hard drive, I can include this in an analysis by refering to (say)
> file:~/loci/containers/genbank/x5559999. This scheme should be able to handle
> live network queries as well (with a URI like genbank:x5559999?).
Yes, this would be the case for both data (e.g., protein structure) and
processor (e.g., find lowest energy conformation of protein via molecular
dynamics).
> In the best possible world, asking for x5559999 should retreive it from
> genbank the first time, then cache it and transparently refer to the local
> copy for every subsequent reference until it expires from the loci cache. This
> is where we really need a database, to keep track of the source and current
> location of any locus. The database may also serve as primary storage for
> localy defined loci, where loci is used here in it's most general form,
> refering to sequence data, results, programs, user interfaces, etc.
I couldn't have said it better myself. We want _minimal_ transfer of
information.
> A third kind of "file" we need to deal with is a compund document, say a
> figbuilder page with a multiple sequence alignment and a 3D structure. If we
> wanted to send this figure to a colaborator we could send it as a database
> containing the figure elements, as a gzip'ed tar file with the individual xml
> files, or as a single xml file containing all the required data and layout.
This is something that will take a lot of thought. I called this compound
document a 'composite locus', meaning you can turn a an entire section of a
Workflow Diagram (containing loci) into a single locus. But when that composite
locus is sent to someone else, do we send just the links or the actual
data/processor? If we send just the links and one locus resides (links to
actual data/processor) on the sender's computer, that data/processor has to be
accessible to the recipient via link, otherwise the data/processor has to be
sent in its entirety. But this is what you wrote below.
> In any case, while we build this figure, the pieces may come from different
> sources: several genbank sequences cached in the DM, two ascii files with
> sequence info in a "container directory", the standard loci for sequence
> alignment editor and PDB structure viewer, etc. Then clicking on the "export
> for collaborator" button would build a file we can mail or publish by bringing
> together all the data, or perhaps just the data that's local leaving the
> exported file with references to the standard items (genbank ID's, standard
> loci, PDB entrys).
Ditto.
Cheers.
Jeff
--
+----------------------------+
| J.W. Bizzaro |
| jeff at bioinformatics.org |
| |
| THE OPEN LAB |
| Open Source Bioinformatics |
| |
| http://bioinformatics.org/ |
+----------------------------+
More information about the Pipet-Devel
mailing list