[Biococoa-dev] Annotation

Mon Feb 21 10:26:31 EST 2005

Nice work on the annotations guys, looks nice and indeed a dictionary 
is the obvious way to go.
Just a few things that came to mind and like to share with you. Two 
issues with the annotations to think about.
First, it would be nice to have a standard set defined for common 
annotations, like author, organism etc. The question of course is what 
list should be adhere to? The EMBL format, the NCBI format?

Second, while searching the web a bit, I came along the BSML XML format 
which seems to become a kind of standard for new sequence formats. It 
would perhaps be nice (and wise) to have a look at the documents they 
made because they (obviously) studied the annotation/feature issue very 
well.
You can find more info at: http://www.bsml.org/
Now, just to make sure, bsml is a file format and one we could 
implement of course, internally the dictionary approach is for us the 
way to go, but it might be an idea to adhere to there nomenclature 
and/or tree/hierarchy. I came already across some nice ideas to keep in 
mind:

>  As research proceeds on a given biological molecule, certain segments 
> of the sequence become interesting for a variety of reasons. Sequence 
> annotation is used to capture this extra information about the 
> sequence data. Positional annotation refers to annotations that are 
> specific to a portion of a sequence. In BSML, positional annotation is 
> captured through Feature tags. Feature tags are child tags of a 
> sequence tag, and therefore a Feature is related to a single sequence. 
> For example, the following tag indicates that the region between 1513 
> and 1962 encodes a particular gene:
>
>  <Feature id="FTR4" title="Leucine TNRA" class="GENE">
>  <Qualifier value-type="gene"/>
>  <Interval-loc startpos="1513 endpos="1962"
>  complement="0"/>
>  </Feature>

So a feature is defined as a "positional annotation" which is a nice 
definition that I had in mind as well. Of course features give the 
extra problem that they have to be kept in sync during editing. 
Therefore it's perhaps better to internally have a dictionary of 
annotations and a dictionary of features.

>  A given DNA sequence could have many features associated with it. 
> Rather than simply encoding all of these flatly, in BSML related 
> feature tags can be aggregated into Feature-Tables. Feature-Tables are 
> intended to provide a logical grouping to features, such as grouping 
> all gene expression features together.

This is nice as well, it allows to have nested annotations and 
features, which is perfectly possible with a dictionary of course. What 
do you guys think of this, to complicated or a desired feature? 
Basically, the dictionary approach we have right now allows us to put 
everything (including resources, data etc) as an annotations, we're not 
limited by strings and such. The question rises if we should go for an 
annotation/feature object (which I kind of like because it allows much 
more standardisation and easier addition of i.e. sorting/updating 
logic), or not and let the user be free to add anything he/she likes 
(which is still a possible with annotation objects as well of course).

>  An annotation can also take the form of a comparison between two 
> sequences. Perhaps two segments are equivalent to one another. In 
> order to achieve this in BSML, a <segment-set> tag can be used to 
> enclose a set of segments represented by <segment> tags. For example, 
> the tag shown in Listing 2 expresses that a region from sequence 
> AB1432 and sequence NZ5723 are equivalent.
This is also a very nice thing to keep in mind. For instance it allows 
to backtrace how a construct was build...

>  One of the core strengths of BSML, however, is the availability of 
> public converters to translate from other formats into BSML. This 
> allows consumers of bioinformatics data to pull together information 
> from disparate sources into a single common language for their 
> research. Surprisingly enough, many of these converters were not 
> developed by LabBook, the company driving BSML as a standard, but 
> rather from third-party adopters and supporters of BSML. For example, 
> Bristol-Myers Squibb has release an open-source adapter into the 
> BioPerl project that translates between the SeqIO format and BSML. 
> Similarly, Cold Spring Harbor Laboratory has released a translator 
> between the ASN.1 format used by GenBank and BSML. The European 
> Bioinformatics Institute provides a translation between EMBL documents 
> and BSML. Every day more and more translators become available, making 
> it possible for researchers and application developers to build tools 
> around BSML while accessing a variety of data sources.
This is nice as well of course, by mirroring to some extend the setup 
of BSML internally we can use these adopters to more easily implement 
the reader/writer classes instead of reinventing the wheel...

Finally, just a thinking out loud here, if we go for a number of often 
used pre-defined tags for annotations and features, how do we then 
"define" them? Perhaps it's nice to have a category added to the 
BCAbstractSequence class, i.e. annotation-extensions that predefines 
methods to add these predefined methods like:
-(NSString *)authorname;
-setAuthorname: (NSString *)author;
-(NSCalendarDate *)creationdate;
-setCreationdate: (NSCalendarDate *)date; (note the possibility to 
return a calendardate instead of string, this way we ensure that all 
dates will be created equally instead of someone entering: 20-2-2003 
and the other 2/20/2003).
etc, including things like predefined position specific annotations 
(aka features).
Although I'm not a fan of categories in frameworks, here it might be a 
nice way to separate the code instead of adding all these things in the 
abstract sequence class. Just an idea though...

Cheers,
Alex

On 21-feb-05, at 7:42, Charles PARNOT wrote:

> At 9:46 PM -0500 2/20/05, Koen van der Drift wrote:
>> On Feb 20, 2005, at 2:05 PM, Charles PARNOT wrote:
>>
>>> It is because you have not #import-ed the BCSequenceProtein header, 
>>> so the compiler does not know it is a subclass of 
>>> BCAbstractSequence.
>>
>>
>> Thanks - I added some more code and fixes. If everyone agress this is 
>> the right approach for the annotations, I will start adding more 
>> code.
>>
>>
>> cheers,
>>
>> - Koen.
>
> I do think the NSMutableDictionary is very appropriate for 
> annotations. Regarding the current implementation, it looks good to 
> me. My only comment is I don't think we need a 'setAnnotations:' 
> method. This is a bit dangerous, particularly with a mutable 
> dictionary as argument. Instead, methods 'removeAnnotationWithKey:', 
> 'removeAllAnnotations' and 'addAnnotationsFromDictionary:' will do the 
> job.
>
> BTW, I corrected some of the code because I needed the compiled 
> framework for the testing unit thing and there was a compiler error.
>
> Let me know, guys, if and when you want me to incorporate the tests in 
> the cvs?
>
> Also, Koen, I have one question about the symbolSet: it seems that all 
> instances of one sequence type use the same symbol set. Is that right? 
> Do you think it is going to stay like this, or are there special cases 
> where we will want to change that? If this is true, we could leverage 
> that knowledge to simplify the init methods of the sequence classes. 
> Let me know, I can explain better what I mean.
>
> charles
>
> -- 
> Help science go fast forward:
> http://cmgm.stanford.edu/~cparnot/xgrid-stanford/
>
> Charles Parnot
> charles.parnot at stanford.edu
>
> Room  B157 in Beckman Center
> 279, Campus Drive
> Stanford University
> Stanford, CA 94305 (USA)
>
> Tel +1 650 725 7754
> Fax +1 650 725 8021
> _______________________________________________
> Biococoa-dev mailing list
> Biococoa-dev at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biococoa-dev
>
>
*********************************************************
                     ** Alexander Griekspoor **
*********************************************************
               The Netherlands Cancer Institute
               Department of Tumorbiology (H4)
          Plesmanlaan 121, 1066 CX, Amsterdam
                     Tel:  + 31 20 - 512 2023
                     Fax:  + 31 20 - 512 2029
                     AIM: mekentosj at mac.com
                     E-mail: a.griekspoor at nki.nl
                 Web: http://www.mekentosj.com

Windows is a 32-bit patch to a 16-bit shell for an 8-bit
operating system, written for a 4-bit processor by a 2-
bit company without 1 bit of sense.

*********************************************************

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 8956 bytes
Desc: not available
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20050221/41cfb14e/attachment.bin>