[Biococoa-dev] SequenceIO
Charles Parnot
charles.parnot at gmail.com
Wed Jun 29 18:35:18 EDT 2005
> Changing the code was not that difficult and I will commit the
> files soon, so everyone can see what is going on. That being said,
> I am running in the following problem. Some file formats have many
> lines with annotations, eg the test2.txt file in the Translation
> example. As you can see some lines have the same identifier (DT,
> OC, etc). If I use that as the key, the final dictionary wil only
> contain the last line, because it will override existing keys. I
> can think of a few solutions. First is what I do now, is to append
> the values to the existing one, leaving only one line with each
> identifier. This works fine, but could give problems if we want to
> write the files out, because we don't know where the different
> lines begin and end. We could of course put some kind of marker
> inbetween the strings, so whe know where each next one begins.
> Another solution could be to assign numbers to identifiers with
> multiple lines, ID1, ID2, ID3, etc. Problem here is that this will
> give preblems when searching for a specific key. My preference
> would be now the first solution, but if anyone has a better
> suggestion, please shout.
Yes, concetenating all the lines, separated by a new-line, seems very
reasonable, and easy to revert. You can use
'componentsJoinedByString' and componentsSeparatedByString', using
@"\n" as the separator (...or @"\r"???).
> Another issue are nested annotations. Again see the test2.txt file
> and look for RN (for reference). It is followed by a set of
> identifiers for the references, and then is followed by another
> reference. I guess I could put the subannotations in a new
> dictionary, and put those in the content of the RN annotation. A
> similar issue can be found in ncbi files (see test4.txt)
Nested annotations are a big issue, particularly regarding sequence
position. We have to come up with something good...
thanks, Koen, for all the work!
charles
--
Xgrid-at-Stanford
Help science move fast forward:
http://cmgm.stanford.edu/~cparnot/xgrid-stanford
Charles Parnot
charles.parnot at gmail.com
More information about the Biococoa-dev
mailing list