> When I wrote to the BioXML mailing list about an XML database, Guy Hulbert gave > this reply: > > > Because this fills your database with blobs, so why use a database at all ? > > You'd be better off, performance-wise, storing the XML docs in the file system > > and just use the database to manage the file store (I worked as a sysadmin for > > a product that did just this for scanned images). No, an xml databse is to avoid filling a database with blobs. It instead uses the xml tags to set up the fields for a database automagically. We also want the database to be able to efficiently answer queries like "show me all the transmembrane domains of the following proteins". > but I think we want a system that will manage loci as files on the filesystem. > This is why: Loci will not be of any particular data format (as I've tried to > stress recently). This will avoid any substantial 'import and translation' > function that will require the Loci system to (1) spend time and space on large > datasets and (2) lock Loci into a one-of-a-kind data format. But it will also > give us a neat way of 'opening' loci: The 'container loci' can merely be set to > read from/write to a certain directory on the filesystem, and the directories > will serve to separate locus categories. So, for example, the user can put all > GenBank docs for Dictyostelium under the directory > > ~/loci/containers/Dictyostelium/ > > and then set one container locus to point to that directory. We'll need a number of features, including a container for files. This kind of container is the simplest to set up, and requires the least amount of effort on the biologists part to understand. We'll also want containers that can parse fasta format files, genbank files etc. Another simple container is a list of links, where a xml file contains a list of references to other loci. > And will serve as 'dead storage' for loci. But if we really want to solve the > '2 terabyte document problem' (for genome analyses, as an example) that Jim > Freeman brought up to me a few weeks ago, we can't duplicate everything that > goes from dead storage to active use. Therefore, loci (treated as files) will > have to either remain in place or be moved to another directory, and NOT > duplicated. This is a separate issue: loci needs a way to reference objects without copying them. So if I have a file with the genbank entry for X5559999 already on my hard drive, I can include this in an analysis by refering to (say) file:~/loci/containers/genbank/x5559999. This scheme should be able to handle live network queries as well (with a URI like genbank:x5559999?). In the best possible world, asking for x5559999 should retreive it from genbank the first time, then cache it and transparently refer to the local copy for every subsequent reference until it expires from the loci cache. This is where we really need a database, to keep track of the source and current location of any locus. The database may also serve as primary storage for localy defined loci, where loci is used here in it's most general form, refering to sequence data, results, programs, user interfaces, etc. A third kind of "file" we need to deal with is a compund document, say a figbuilder page with a multiple sequence alignment and a 3D structure. If we wanted to send this figure to a colaborator we could send it as a database containing the figure elements, as a gzip'ed tar file with the individual xml files, or as a single xml file containing all the required data and layout. In any case, while we build this figure, the pieces may come from different sources: several genbank sequences cached in the DM, two ascii files with sequence info in a "container directory", the standard loci for sequence alignment editor and PDB structure viewer, etc. Then clicking on the "export for collaborator" button would build a file we can mail or publish by bringing together all the data, or perhaps just the data that's local leaving the exported file with references to the standard items (genbank ID's, standard loci, PDB entrys). What's a bonobo compound document (i.e., a guppi figure in a gnumeric spreadsheet) look like anyway? Does anyone know? -- Humberto Ortiz Zuazaga Bioinformatics Specialist Institute of Neurobiology hortiz at neurobio.upr.clu.edu