[Biococoa-dev] SequenceIO

Tue Jun 28 20:15:30 EDT 2005

Hi,

Where is everyone? Enjoying a vacation, or hard at work, or passed out 
from the heatwave :)

I started thinking again about the IO classes. Right now, 
BCSequenceReader returns a dictionary containing one or more sequences 
as the values, and either a description, or title as the keys. This 
will allow that files containing multiple sequences can be read into 
the dictionary. Accessing the sequences is not so straightforward. 
Basically now the user first needs get the key for the sequence value 
from an array of keys, and then use that key to obtain the sequence 
from the dictionary. This seems rather cumbersome, I think.

Therefore I propose that BCSequenceReader simply returns an array of 
objects. We can either store BCSequence objects in the array or create 
some kind of wrapper for each sequence, eg a new SequenceIO class. 
Annotations and features are now handled in the BCSequence class, so 
can be added in the IO code.

So for a simple fasta class we would have an array of sequences with 
one annotation, with the key @">" and the value whatever string follows 
the first line.  For a more complicated sequence-format, eg SwissProt, 
basically all annotations are read in line by line, using the 
file-specific keys (@"ID", @"AC", @"DT" etc). Then when it hits the 
sequence, we can create a BCSequence object, and at the end store the 
annotations in the BCSequence. I suggest the keys should be whatever 
the fileformat uses, but or somecommon annotations, like author, 
organism, we could supply some more human readable accessor methods.

cheers,

- Koen.