Koen van der Drift
kvddrift at earthlink.net
Tue Jun 28 20:15:30 EDT 2005
Where is everyone? Enjoying a vacation, or hard at work, or passed out
from the heatwave :)
I started thinking again about the IO classes. Right now,
BCSequenceReader returns a dictionary containing one or more sequences
as the values, and either a description, or title as the keys. This
will allow that files containing multiple sequences can be read into
the dictionary. Accessing the sequences is not so straightforward.
Basically now the user first needs get the key for the sequence value
from an array of keys, and then use that key to obtain the sequence
from the dictionary. This seems rather cumbersome, I think.
Therefore I propose that BCSequenceReader simply returns an array of
objects. We can either store BCSequence objects in the array or create
some kind of wrapper for each sequence, eg a new SequenceIO class.
Annotations and features are now handled in the BCSequence class, so
can be added in the IO code.
So for a simple fasta class we would have an array of sequences with
one annotation, with the key @">" and the value whatever string follows
the first line. For a more complicated sequence-format, eg SwissProt,
basically all annotations are read in line by line, using the
file-specific keys (@"ID", @"AC", @"DT" etc). Then when it hits the
sequence, we can create a BCSequence object, and at the end store the
annotations in the BCSequence. I suggest the keys should be whatever
the fileformat uses, but or somecommon annotations, like author,
organism, we could supply some more human readable accessor methods.
More information about the Biococoa-dev