Humberto Ortiz Zuazaga wrote: > > No, an xml databse is to avoid filling a database with blobs. It instead uses > the xml tags to set up the fields for a database automagically. We also want > the database to be able to efficiently answer queries like "show me all the > transmembrane domains of the following proteins". Again, I think we need a database that can facilitate queries of just about any kind. So, for example: DATABASE LOCUS -----> "Show me all the transmembrane -----> DATA LOCUS domains of the proteins in the DB." > We'll need a number of features, including a container for files. This kind of > container is the simplest to set up, and requires the least amount of effort > on the biologists part to understand. Yep. > We'll also want containers that can > parse fasta format files, genbank files etc. This would depend on the query mechanism (locus) used. But the DB should be generic enough to handle just about anything. > Another simple container is a > list of links, where a xml file contains a list of references to other loci. This may just be a matter of implementation: The DB is a list of links to files on the file system? > This is a separate issue: loci needs a way to reference objects without > copying them. So if I have a file with the genbank entry for X5559999 already > on my hard drive, I can include this in an analysis by refering to (say) > file:~/loci/containers/genbank/x5559999. This scheme should be able to handle > live network queries as well (with a URI like genbank:x5559999?). Yes, this would be the case for both data (e.g., protein structure) and processor (e.g., find lowest energy conformation of protein via molecular dynamics). > In the best possible world, asking for x5559999 should retreive it from > genbank the first time, then cache it and transparently refer to the local > copy for every subsequent reference until it expires from the loci cache. This > is where we really need a database, to keep track of the source and current > location of any locus. The database may also serve as primary storage for > localy defined loci, where loci is used here in it's most general form, > refering to sequence data, results, programs, user interfaces, etc. I couldn't have said it better myself. We want _minimal_ transfer of information. > A third kind of "file" we need to deal with is a compund document, say a > figbuilder page with a multiple sequence alignment and a 3D structure. If we > wanted to send this figure to a colaborator we could send it as a database > containing the figure elements, as a gzip'ed tar file with the individual xml > files, or as a single xml file containing all the required data and layout. This is something that will take a lot of thought. I called this compound document a 'composite locus', meaning you can turn a an entire section of a Workflow Diagram (containing loci) into a single locus. But when that composite locus is sent to someone else, do we send just the links or the actual data/processor? If we send just the links and one locus resides (links to actual data/processor) on the sender's computer, that data/processor has to be accessible to the recipient via link, otherwise the data/processor has to be sent in its entirety. But this is what you wrote below. > In any case, while we build this figure, the pieces may come from different > sources: several genbank sequences cached in the DM, two ascii files with > sequence info in a "container directory", the standard loci for sequence > alignment editor and PDB structure viewer, etc. Then clicking on the "export > for collaborator" button would build a file we can mail or publish by bringing > together all the data, or perhaps just the data that's local leaving the > exported file with references to the standard items (genbank ID's, standard > loci, PDB entrys). Ditto. Cheers. Jeff -- +----------------------------+ | J.W. Bizzaro | | jeff at bioinformatics.org | | | | THE OPEN LAB | | Open Source Bioinformatics | | | | http://bioinformatics.org/ | +----------------------------+