[Pipet Devel] Tools for parsing XML with Python

Humberto Ortiz Zuazaga hortiz at neurobio.upr.clu.edu
Mon Sep 27 12:29:57 EDT 1999


> When I wrote to the BioXML mailing list about an XML database, Guy Hulbert gave
> this reply:
> 
> > Because this fills your database with blobs, so why use a database at all ?
> > You'd be better off, performance-wise, storing the XML docs in the file system
> > and just use the database to manage the file store (I worked as a sysadmin for
> > a product that did just this for scanned images).

No, an xml databse is to avoid filling a database with blobs. It instead uses 
the xml tags to set up the fields for a database automagically. We also want 
the database to be able to efficiently answer queries like "show me all the 
transmembrane domains of the following proteins".

> but I think we want a system that will manage loci as files on the filesystem. 
> This is why: Loci will not be of any particular data format (as I've tried to
> stress recently).  This will avoid any substantial 'import and translation'
> function that will require the Loci system to (1) spend time and space on large
> datasets and (2) lock Loci into a one-of-a-kind data format.  But it will also
> give us a neat way of 'opening' loci: The 'container loci' can merely be set to
> read from/write to a certain directory on the filesystem, and the directories
> will serve to separate locus categories.  So, for example, the user can put all
> GenBank docs for Dictyostelium under the directory
> 
>     ~/loci/containers/Dictyostelium/
> 
> and then set one container locus to point to that directory.

We'll need a number of features, including a container for files. This kind of 
container is the simplest to set up, and requires the least amount of effort 
on the biologists part to understand. We'll also want containers that can 
parse fasta format files, genbank files etc. Another simple container is a 
list of links, where a xml file contains a list of references to other loci.

> And will serve as 'dead storage' for loci.  But if we really want to solve the
> '2 terabyte document problem' (for genome analyses, as an example) that Jim
> Freeman brought up to me a few weeks ago, we can't duplicate everything that
> goes from dead storage to active use.  Therefore, loci (treated as files) will
> have to either remain in place or be moved to another directory, and NOT
> duplicated.

This is a separate issue: loci needs a way to reference objects without 
copying them. So if I have a file with the genbank entry for X5559999 already 
on my hard drive, I can include this in an analysis by refering to (say) 
file:~/loci/containers/genbank/x5559999. This scheme should be able to handle 
live network queries as well (with a URI like genbank:x5559999?).

In the best possible world, asking for x5559999 should retreive it from 
genbank the first time, then cache it and transparently refer to the local 
copy for every subsequent reference until it expires from the loci cache. This 
is where we really need a database, to keep track of the source and current 
location of any locus. The database may also serve as primary storage for 
localy defined loci, where loci is used here in it's most general form, 
refering to sequence data, results, programs, user interfaces, etc.

A third kind of "file" we need to deal with is a compund document, say a 
figbuilder page with a multiple sequence alignment and a 3D structure. If we 
wanted to send this figure to a colaborator we could send it as a database 
containing the figure elements, as a gzip'ed tar file with the individual xml 
files, or as a single xml file containing all the required data and layout.

In any case, while we build this figure, the pieces may come from different 
sources: several genbank sequences cached in the DM, two ascii files with 
sequence info in a "container directory", the standard loci for sequence 
alignment editor and PDB structure viewer, etc. Then clicking on the "export 
for collaborator" button would build a file we can mail or publish by bringing 
together all the data, or perhaps just the data that's local leaving the 
exported file with references to the standard items (genbank ID's, standard 
loci, PDB entrys).

What's a bonobo compound document (i.e., a guppi figure in a gnumeric 
spreadsheet) look like anyway? Does anyone know?
-- 
Humberto Ortiz Zuazaga
Bioinformatics Specialist
Institute of Neurobiology
hortiz at neurobio.upr.clu.edu






More information about the Pipet-Devel mailing list