[Biococoa-dev] Annotation

Mon Feb 21 14:16:10 EST 2005

>This is nice, and we should try for compatibility, but a bit difficult to work as a dictionary.  The nice part is that it has a uniqueID, name, and class.  The bad part is that they're all part of the same compound field, so they don't work nicely as the dictionary key.  
>
>A related issue is that it would be really nice to be able to get annotations for every exon or every ORF without having to enumerate through the keys of the whole dictionary and check a field in each.  There's two ways I can think of doing this - within the annotation wrapper, keep arrays for each feature type and put things into the appropriate one as they're added.  The alternative would be to make sure we write the appropriate code to do the enumeration.  Personally, for performance reasons, I'd favor the first.
>
>JT

I think if we go for the bsml format, we should stick to its structure as much as possible, and not try to outsmart it for performance or readability issues. A smart structure like you suggest could be better than a simple dictionary, I agree. But we want to make sure we don't give up the flexibility, by restricting the queries to some keywords, or even to the fact that all features will have a feature type or that new tags will never be added. I suppose you have this flexibility issue in mind, but I just want to be sure! With my limited experience in data structures for xml, I know I would design a BCFeatureSet class that mimick as closely as possible the bsml/xml format. I don't know if this is really feasible or how to do it... Well, I guess I would use a dictionary, like proposed by Alex. But again, there might be nice data structures already designed for xml, out there, and maybe this is what you had in mind, John.

In any case, I don't think we should worry too much about performance at this point. Enumerating through dictionaries of dictionaries or arrays of dictionaries will only create performance issues when going through tens of thousands  of sequences. And even if you do go through tens of thousands of sequences, there might be many other places that will be a performance bottleneck, like simply loading the sequences from disk.

Last point. It seems that Features will be complicated enough that they deserve a special class, even if it is just to wrap a dictionary. We want then to encapsulate the inner workings as much as possible and expose the minimum amount of public methods. If and when performance become an issue, we can redesign the inside and leave the outside unchanged. And with the unit testing, it will be a breeze to check the consistency of a new design;-)

So here are my 'bottom line' questions and comments:
* I think the annotations could be treated separately as they are now; this way we will get something that works well for annotations, which is a first step to get a useful framework; features are a completely different world with many implications; in other words, I am talking of  BioCocoa 1.0 vs 1.5 ;-); if the public methods are carefully designed, it should be easy to connect the new Feature design to the current implementation of annotations when and if needed
* do we need a BCFeature class? do we need a BCFeatureSet class? Probably only the latter. It seems the bsml format would require a BCFeatureSet class which could not be broken down into BCFeature elements, because it has a nested structure. If any of you has experience with data structures for xml, speak up!

charles

-- 
Help science go fast forward:
http://cmgm.stanford.edu/~cparnot/xgrid-stanford/

Charles Parnot
charles.parnot at stanford.edu

Room  B157 in Beckman Center
279, Campus Drive
Stanford University
Stanford, CA 94305 (USA)

Tel +1 650 725 7754
Fax +1 650 725 8021