jtimmer at bellatlantic.net
Thu Nov 11 11:05:42 EST 2004
A couple of additional thoughts -
I agree with Alex, in that simply calculating the percentage of GCAT should
give you a strong sense of what the sequence is in the majority of
situations. There might be a very quick way of doing that, though I'm not
sure. T vs. U should also be considered - maybe count the Us, then decide
whether to count GCAT or GCAU.
Another thing is that FASTA provides a comment line, which often indicates
what the sequence is, though I'm pretty sure this isn't standardized (ie -
people are probably free to call a DNA sequence a "protein coding region",
so selecting basic terms from the comments will probably fail).
The last thought is that it's most important for there to be a defined order
of assumptions. Explicitly state which conditions are tested in which order
and what the fallback is, so people know what they're getting into.
The last thing is that I think FASTA defines an alignment format, too - does
the existing code account for this?
> I have added an initial attempt for a new class BCSequenceReader. I
> also added some code to the translation demo to test this. I am using
> the original code from Peter, so the code figures out what the format
> of the data is. For now I have only added a readFasta method. Fasta
> files (and other formats as well) can contain DNA sequences or protein
> sequences. But how do I figure out which of the two I am dealing with,
> so I can return the proper subclass of BCSequence? Any suggestions how
> to approach this?
> - Koen.
This mind intentionally left blank
More information about the Biococoa-dev