[Biococoa-dev] NSData

Wed Jul 20 13:54:18 EDT 2005

> The whole translation stuff was set up by John in one flight over the
> atlantic (or was it the Channel?). It needed some work anyway, but
> the disappearance of BCSequenceCodon complicates things quite a bit.
> We could keep a private BCSequenceCodon class for now, or move the
> relevant code to the translation tool. I like the idea of a
> BCSequenceCodon type, though. Well, some more debate to come!

It was definitely transatlantic - I couldn't type fast enough to do it
across the channel, much less think fast enough.

Some of the design decision behind it:

I didn't want to initially make a codon class, since they're a bit odd (a
sequence and two types of symbols all wrapped in a single class), but Alex
convinced me they'd be a good idea.  In the end, I got them to work
reasonably efficiently, and they do encapsulate the codon matching pretty
well and the ambiguous symbols handle the wobble base very cleanly.  Also,
having them encapsulate symbols and contain sequences meant that many of the
optimizations we made on these classes sped up translation "for free", and
translation in turn made a nice test case to identify slow code in these
classes.

Once we had codons, which represent an intermediate in translation, a codon
sequence to represent the intermediate state seemed like an obvious choice.
This allows you to retain the results of a translation and extract different
bits of information (how many ORFs?  What's the longest?  Give me its
description for debugging purposes) from it without re-translating every
time.  This didn't seem like a waste of memory, since anyone not caring
about the translation intermediary would probably just grab the resulting
protein sequence anyway and dispose of the codon sequence.

As structured, however, I don't think it cached translations when you did
things like switch reading frames or genetic codes - adding that feature
would make it even more useful (you can ask it more questions, like what's
the longest ORF in any reading frame?  Could this be a mitochondrial gene?),
although it would require it to wrap multiple sequences, rather than being a
sequence object.  I might have tried adding this to the tool class, but I
can't remember.

Incidentally, regarding the deletion of symbols from the initial .plist
dictionary you asked about:  The idea is that a user could add a custom
symbol (say they were playing with synthetic nucleotides) simply by adding
the information to the .plist file, rather than by coding it.  Instead of
being present as a singleton (since we're not adding code), we retained a
singleton dictionary based on the .plist file.  Any non-standard symbol can
be retrieved from that at any time (there's a method for it), so they're
effectively singletons as well.

The symbol deletions are simply getting rid of the standard bases from this
dictionary.  This should cut down memory use a tiny bit, and should also
speed the lookup of custom symbols, by having many fewer key/value pairs in
the dictionary.  I have no idea if it really does help in a significant way,
so it's a bit of a premature optimization, but it was one line of copy/paste
code, so it seemed like a reasonable choice.

JT

_______________________________________________
This mind intentionally left blank