Koen van der Drift
kvddrift at earthlink.net
Sat Nov 13 19:49:01 EST 2004
On Nov 13, 2004, at 9:44 AM, John Timmer wrote:
> All the sequence classes use [symbol undefined] of the appropriate
> if they hit a character they can't recognize. Koen also put the
> sequenceCountedSet code in. Simply send the string to each of the
> sequence classes, then use the counted set to determine the one which
> results in the fewest undefined symbols. If the number turns out to be
> equal, use DNA > RNA > protein to decide which sequence to use so
> that we
> can stay within the central dogma.
> The code should be very clean and easy to follow, though it may not be
> fast as I'd like, given there's three sequence objects created and
That's a problem, I agree. But this situation is not going to happen
that often, because in most cases the user probably knows what format
is used. However, we should be prepared for such cases. I suggest we
use a sequencefactory class that takes care of creating sequences in a
centralized location, instead of scattered throughout the framework in
classes that might encounter such situations. I will have a look at
this this weekend, to see if I can get this to work.
> My previous thoughts follow - disregard them unless you think the
> above is a
> bad idea:
I don't know yet :)
> Where this is going to work poorly is very short sequences, like
> sites - I think we should only enter this code if the sequence is over
> or so. Maybe we should just treat anything under 10 characters as a
I would call it a peptide then ;-)
> One other thought - I know the nucleotides have a non-base character,
> you also have code for
Actually, proteins can have ambigous symbols as well, I still need to
update the BCSymbolAminoAcid class. I will post another message on this
subject in a new thread.
> And that guy who answered your email was VERY optimistic in assuming
> an accession number in the comment field....
Yeah, that's not going to work.
More information about the Biococoa-dev