<HTML>

<HEAD>

<TITLE>Re: [Biococoa-dev] Annotation</TITLE>

</HEAD>

<BODY>

<BLOCKQUOTE><FONT FACE="Verdana, Helvetica, Arial"><SPAN STYLE='font-size:12.0px'><BR>

Second, while searching the web a bit, I came along the BSML XML format which seems to become a kind of standard for new sequence formats. It would perhaps be nice (and wise) to have a look at the documents they made because they (obviously) studied the annotation/feature issue very well.  <BR>

You can find more info at: <a href="http://www.bsml.org/">http://www.bsml.org/</a> <BR>

Now, just to make sure, bsml is a file format and one we could implement of course, internally the dictionary approach is for us the way to go, but it might be an idea to adhere to there nomenclature and/or tree/hierarchy. I came already across some nice ideas to keep in mind: <BR>

<BR>

</SPAN></FONT><BLOCKQUOTE><FONT FACE="Verdana, Helvetica, Arial"><SPAN STYLE='font-size:12.0px'> As research proceeds on a given biological molecule, certain segments of the sequence become interesting for a variety of reasons. Sequence annotation is used to capture this extra information about the sequence data. Positional annotation refers to annotations that are specific to a portion of a sequence. In BSML, positional annotation is captured through Feature tags. Feature tags are child tags of a sequence tag, and therefore a Feature is related to a single sequence. For example, the following tag indicates that the region between 1513 and 1962 encodes a particular gene: <BR>

</SPAN></FONT><SPAN STYLE='font-size:12.0px'><FONT FACE="Arial"><B> <BR>

</B></FONT><FONT COLOR="#006312"><FONT FACE="Verdana, Helvetica, Arial"> <Feature id="FTR4" title="Leucine TNRA" class="GENE"> <BR>

 <Qualifier value-type="gene"/> <BR>

 <Interval-loc startpos="1513 endpos="1962" <BR>

 complement="0"/> <BR>

 </Feature></FONT></FONT><FONT COLOR="#008000"><FONT FACE="Arial"><B> <BR>

</B></FONT></FONT></SPAN></BLOCKQUOTE><SPAN STYLE='font-size:12.0px'><FONT FACE="Verdana, Helvetica, Arial"> <BR>

So a feature is defined as a "positional annotation" which is a nice definition that I had in mind as well. Of course features give the extra problem that they have to be kept in sync during editing. Therefore it's perhaps better to internally have a dictionary of annotations and a dictionary of features.  </FONT><FONT COLOR="#008000"><FONT FACE="Arial"><B> <BR>

</B></FONT></FONT></SPAN></BLOCKQUOTE><SPAN STYLE='font-size:12.0px'><FONT FACE="Verdana, Helvetica, Arial"> <BR>

<BR>

This is nice, and we should try for compatibility, but a bit difficult to work as a dictionary.  The nice part is that it has a uniqueID, name, and class.  The bad part is that they’re all part of the same compound field, so they don’t work nicely as the dictionary key.  <BR>

<BR>

A related issue is that it would be really nice to be able to get annotations for every exon or every ORF without having to enumerate through the keys of the whole dictionary and check a field in each.  There’s two ways I can think of doing this – within the annotation wrapper, keep arrays for each feature type and put things into the appropriate one as they’re added.  The alternative would be to make sure we write the appropriate code to do the enumeration.  Personally, for performance reasons, I’d favor the first.<BR>

<BR>

JT<BR>

<BR>

<BR>

</FONT><FONT FACE="Georgia, Times New Roman"><BR>

</FONT><FONT FACE="Verdana, Helvetica, Arial">_______________________________________________<BR>

This mind intentionally left blank<BR>

</FONT></SPAN>

</BODY>

</HTML>