Hello Locians! I am working on a simmilar package to Loci called the General Sequence Toolkit (gst). Currently I am designing the storage interface for the gst package. I thought I'd share some of my ideas on this list for the benifit of both GST and Loci. Warning, I am not a computer science person, but a molecular biology person, so I may miss the obvious (but hopefully not :o). I am loosly defining the "Data Manager" or DM as the interface and backend or server side code for managing bioinformatics data of various types. Some of the design goals for the DM: 1) Allow for extendibility (ie DM plugins) 2) Simple, general, but sufficient interface 3) Minimize transfere of un-needed data 4) Allow for relocation of data 5) Allow for read only access 6) Enable wrappers for common non-xml, non-gst data sets (ie genbank) 7) Allow multi user access 8) The interface should not assume anything regarding the data except that it is in an XML format. 9) Enable network transparency 10) Simple and robust xreferencing 11) ??? In the most basic sense, GST would not have a DM but rather a DM interface to DM plug-ins. Some examples: 1) Genbank/Entrez DM would consist of a local plug-in that provide the program with read only access to NCBI's databases over the internet. The non-xml genbank entries would automatically be wrapped/converted into xml by the DM. 2) Intranet Oracle or AceDB database DM would consist of a local plug-in (as well as a server for the plug-in possibly). The plug-in would handle the network transparency as well as wrapping/converting the data to xml. 3) A file system DM would consist of a plugin for GST as well as a server for the plug-in. Transfere of data from the file system to GST would be handled by a socket b/w the plugin and the server. When a user starts up GST, if a filesystem DM for his/her personal GST directory is not running, one is automatically started. If the user is on another computer, they can still access their personal GST directory as long as the file system DM is running on the same computer as the personal GST directory tree. So rather than working with files straight off the filesystem (as in GDE), a file/data would have to be imported into one of the available DM's before use in GST. So with that overall model in mind, there are several issues that I can think of: 1) Is this a good model? 2) What are the functions that the interface must handle? 3) How will robust cross referencing be done (ie I am working on a project and all the sequences are available through the oracle DM and the blast results are available through the filesystem DM.) Thoughts on the cross referencing issue: Using URI's along with XLinks and Xpointers is probably the best way given that the data model is centered on XML. Along those lines, the URI might look like: gstoracle://oracle.server.bio.com:85/genbank/pri/U29875 where gstoracle defines the data access method (which DM plugin to use), oracle.server.bio.com:85 is the host computer for the data (and port number), and genbank/pri/U29875 is the remaining portion of the reference to get to the desired sequence. So working from this, we should allow for aliases of the protocol://hostname:port portion. This is necessary if we are going to simplify the process of relocating data. The question is where and how should these alias lists be maintained. One possibility is to have an alias list maintained by each client side gst installation. Another posibility is to have the server side of the DM's maintain their own alias lists. A third option is to have both the client side installations and the DM server sides maintain alias lists. The forth option (which I am leaning toward) is that each DM would provide it's own alias upon request and the gst client program when started would update it's own person list of aliases on startup. This way the DM's could request the alias for cross references and store just the alias. example: gstfile://server.bio.com/g349f7 would be converted to gstserver1/g349f7 before storing this reference. if the gstserver1 data was then moved to an oracle database on a different computer, the administrator would only need to edit the personal lists on the DM and client installations inorder to force the initial lookup. No editing of the actual data would be needed. As for encoding these links, it would make sense to use xlink's to refer to xml entities and xpointers to reference within entities. In addition, the DM would be responsible for assigning new ID's so that within each DM ID's are unique. Hence the URI for a GST xml resource would include the ID and would be unique. So the next issue is the interface: This first question in my mind is whether the DM should handle parsing of the XML files or should it just pass/take whole xml files. The interface would be simpler if the DM just passed the whole xml file (in other words the interface deals with xml entities only and not parts of entities). On the other hand, if a user just needs one part of the entity, the whole entity must still be transfered over a potentially slow network connection. The alternative is to have the xml parsing take plase on the DM side of the interface. This would allow for only the requested element to be transfered. In addition, only the requested portion from a non xml data set would have to be wrapped in xml. So despite the additional complexity, it would seem that having the interface be aware of xml parts and not just entities would be highly advantageous. So what are some of the necessary interface functions: 1) Reserve/request unique ID/URI 2) lock/unlock URI 3) Revoke a URI lock 4) Initialize/Login 5) Close/Shutdown 6) Delete Entity/Element 7) Update Entity/Element 8) Add Element 9) Search which returns an XML document with extended xlinks for hits 10) Ability to return browse lists 11) Retrieve alias 12) Retrieve DM Info: * Name (URI) * Description * Read only or Read/Write * ??? Sorry about the lengthing posting, but hopefully it wills stimulate some profitable discussion. -Alan ************************************************************************ Alan Williams (finger alan at avocado.ucr.edu for pgp public key) ------------------------------------------------------------------------ University of California, Riverside "Where observation is concerned, Dept. of Botany and Plant Sciences chance favors the prepared mind." Alan at TheWilliamsFamily.org -- Louis Pasteur ************************************************************************