> I'd like to propose that Loci use many small XML DTDs instead of > trying for a kitchen sink DTD. I agree, and that is basically the way I had been thinking. Specific descriptions of data should be as small and modular as possible (sequence, structure, phylogeny, etc). LocusML should also be able to describe relationships between those pieces of data, if necessary, however. We might need specific DTDs for relationships (ie. a restriction map, which contains a number of short sequence components), as a lot of relationships will be very hard to express generically. > we could use a simplified BSML for sequence information (and just > sequences). I don't like how BSML is structured, but I do like the detail it allows. I prefer how the inner sections of BioML for its "flow". I had planned to merge the two, looking more like BioML but with the versatility of BSML. Also, BSML doesn't cover amino acid sequences (if I remember correctly), while BioML does. The two different structures probably merit unique DTDs anyway, though. > a different XML dialect could be used for structure information > (including structural annotations in sequences). Yes. I'm not sure where to begin on structure. Someone here had ideas on this, but I'm not sure who or what became of them. Also, it sounds like we'll need a DTD for phylogeny, too. There are probably others as well, but the concept remains the same. Describe just the relevant data, and use a unique ID to find reference and data relations elsewhere. > separate XML DTDs could be defined for references, options to be passed > to a program, and work paths. A generic reference DTD is fairly simple. Describing relationships between data will take a little more thought. Loci specific information will probably be filled in as we go further into development. Although, just so no one is confused, the XML format is really only for transfer (data) and storage (both data and Loci info). Actual Loci info will be kept in the Paos object as attributes rather than an XML stream that has to be parsed all the time. It will be written out to XML for non-Paos storage. Generic data will always be handled by the Loci framework as XML (since it's basically meaningless to it), and the data specific tools will handle it internally in whatever way is appropriate (hash table, binary tree, etc). But a DTD is a good way to describe the data Loci uses. > I retrieve a nucleotide sequence from Genbank, in genbank format, it > is parsed into several XML objects: a nucleotide sequence object, several > bibliographic reference objects, a protein sequence object for the > "/translation=" feature found in the original genbank file. Exactly. I believe the translation component is the gatekeeper, in Loci terminology. > Each xml object is displayed on the benchtop by the apropriate locus. > Now I click on the button to perform a restriction map of my sequence. I haven't thought about the UI much yet. > The workspace contacts the restriction map locus, which returns an XML > object describing the parameters and options this restriction map > locus requires or supports. _That_ is an interesting idea. I had just been assuming a generic interface for types of loci (for example, a restriction map locus has three arguments and it doesn't vary), but rather than having a bunch of hardcoded loci types, we can query the locus for it's interface (of course we'll want to cache interfaces). > An option handling locus can then prompt > me for the enzymes I want to cut with, the output format I prefer, > etc. Going back to Jeff's idea about embedding python in XML, a locus could return an interface description with UI code to handle the query configuration (probably optional for exotic cases; most of the time it would be generic fields with default UI handlers). > The restriction map locus can now return the results as several xml > objects: a bibliographic reference object describing the algorithm > used to perform the analysis; a result object containing the requested > results; a locus object containing the gnome-python source code for > a gui-locus that can display the results. Before we go overboard with passing interface code around though, I'd like to strongly encourage the presence of powerful, high-level widgets in the workspace app. We don't want to be passing around a generic sequence viewer all the time. > The workspace can check if it already has a gui-locus that can display > the results, and pases the results to it, or downloads the code and > generates the gui-locus. Like I said just above, I'd like to see a nice API (from the loci perspective) for the UI stuff. Ranging from low-level building block widgets to higher-level generic viewers, as well as the ability to plug-in additional generic viewers. That way if you're always using some non-standard locus gui, you can just load the script locally (and even replace it with faster compiled code). > As loci are loaded into the workspace, they can register the ability > to handle a particular DTD or set of DTDs. Possibly even more than that -- for instance, a loci to handle a specific relationship between sets of DTDs (I don't have a good example, though). > We also need not pass around the entire XML object each time, for > example only a url for a reference need be included in the results > from an analysis, not the entire paper. Yes. It was my intention for the workflow system to just give a locus what it needs (probably by creating a second Paos object). It should present it with the necessary data and control information, rather than sending the whole object with potentially extraneous data and control info. The locus updates the object with status information (recorded to the master Paos object, which the gui can get info from). And then transmits the generated data back via Paos. That's consolidated into the master object and fed to the gui client. Also, for the Paos representation of the Loci XML info, I was imagining a DOM-like interface. The XML is represented in a tree. So, this Loci info: <query id="aaaa"> <action>restriction map</action> <option name="distinguish enzyme cuts" value="yes"> <data argument="template">#sequence_id</data> <data argument="restriction enzyme">EcoR1</data> <data argument="restriction enzyme">BamH1</data> </query> becomes this Paos object: query.id = "aaaa" query.action = "restriction map" query.option{distinguish enzyme cuts} = "yes" query.data{template} = "#sequence_id" query.data{restriction enzyme} = "EcoR1" query.data{restriction enzyme} = "BamH1" where #sequence_id means the XML can be extracted from the Paos data attribute under the key "sequence_id" Or something along these lines. This example is missing a lot of things. I'm not sure how python handles hashes either. This is actually perl/c-ish here. (note: Ok, now I'm going to ramble some...) Although, perhaps we don't even need to bother trying to express the internal Loci data stuff as XML. Will we ever need to write it out to XML? Possibly only the actual biological data needs XML expression, just to facilitate interaction between Loci derived data and non-Loci tools. Theoretically, we don't need XML for anything, since structures in Paos could hold all of the biological data, too. It just seems like a good way to describe things for stuff that isn't entirely internal to Loci. But on similar grounds, we will need to define the internal Loci info interface adequately for tools to make use of it, and perhaps an XML representation of that would make it more clear. I'd really like to rig up a working demo. Does anyone have a pretty simple analysis tool we could use for an example? In particular, the view of the resulting data should be simple (that's probably where the most programming is). Actually a restriction map wouldn't be too bad... Justin Bradford justin at ukans.edu