[Biococoa-dev] BCSequenceReader

Thu Nov 11 11:05:42 EST 2004

A couple of additional thoughts -

I agree with Alex, in that simply calculating the percentage of GCAT should
give you a strong sense of what the sequence is in the majority of
situations.  There might be a very quick way of doing that, though I'm not
sure.  T vs. U should also be considered - maybe count the Us, then decide
whether to count GCAT or GCAU.

Another thing is that FASTA provides a comment line, which often indicates
what the sequence is, though I'm pretty sure this isn't standardized (ie -
people are probably free to call a DNA sequence a "protein coding region",
so selecting basic terms from the comments will probably fail).

The last thought is that it's most important for there to be a defined order
of assumptions.  Explicitly state which conditions are tested in which order
and what the fallback is, so people know what they're getting into.

The last thing is that I think FASTA defines an alignment format, too - does
the existing code account for this?

Cheers,

JT

> I have added an initial attempt for a new class BCSequenceReader. I
> also added some code to the translation demo to test this. I am using
> the original code from Peter, so the code figures out what the format
> of the data is. For now I have only added a readFasta method. Fasta
> files (and other formats as well) can contain DNA sequences or protein
> sequences. But how do I figure out which of the two I am dealing with,
> so I can return the proper subclass of BCSequence? Any suggestions how
> to approach this?
> 
> thanks,
> 
> - Koen.

_______________________________________________
This mind intentionally left blank