[Biococoa-dev] BCNucleotideRNA

Sun Sep 26 05:38:41 EDT 2004

Part two ;-)

First the email of Koen:

>>> - Is it necessary to duplicate all the code from BCNucleotideDNA,
>>> except for the uridine instead of thymidine? Why not have a class
>>> BCNucleotide that takes care of everything, and just have two
>>> subclasses with the small differences?
>> That was my long-term intention.  I just wanted to get things working
>> quickly so that I could get translation and codons handled.  Once I 
>> was
>> happy with how they worked, I figured I could go back and sort out a 
>> better
>> class structure.  Even so, all the base methods have to be separate 
>> for DNA
>> and RNA, as they need to return something different so that == works 
>> as
>> expected, so the classes won't shrink as much as I'd have hoped.
>
> I still think we can add a BCNucleotide class, because there are a 
> couple of methods that are exactly the same in both classes.

Yep, I saw that as well. For instance one of the troubles encountered 
was the presence of the +unmatched and -matchesTriplet calls that 
aren't there in BCCodon but exist in BCCodonDNA and BCCodonRNA. If we 
would have a general BCNucleotide class, it would completely make the 
BCCodonDNA and -RNA class unnecessary as all BCCodon methods could then 
be the same for both types of nucleotides. It makes more sense to have 
a BCSymbol --> BCNucleotide --> BCNucleotideDNA / BCNucleotideRNA, if 
that would save a lot of divergence in other classes like BCCodon and a 
lot of tools probably.

> I had another look at BioJava to see how they solve this. Actually, 
> they only have a Symbol class, no SymbolDNA, SymbolRNA, 
> SymbolAminoAcid, etc. The same for Sequence, no subclassing. What they 
> do is pass the right Alphabet (I think this is what Alex is working on 
> in BCSymbolSet) to create a sequence, and you know you're dealing with 
> DNA, RNA, Protein, etc.  It uses a couple of more classes (Alphabet, 
> AlphabetManager, SequenceFactory, etc), but is a very elegant 
> solution, IMO.

Yes, it is. Still, that doesn't exclude that ours is not elegant as 
well ;-) Seriously, we have discussed this way back and I pointed to 
the way BioJava does things. Indeed the alphabet stuff is quite nice, 
but overall I like the way we have done things just as much. In fact 
the way we use class methods to generate shared instances, instead of 
factory objects is something more elegant in my opinion.

I think the question is not so much whether type checking is a problem, 
what we should try to do is design our methods in a way it works as 
general as possible. The BCNucleotide discussion is a nice example. You 
could have two translator to codons for DNA and RNA, but with the 
nucleotide class, we can make a single translator for nucleotides. It's 
key to keep this solution in mind all the time.

The BCSymbolSet is indeed a way to create Alphabets, which can come in 
handy during all kinds of situations, but the way we have different 
nucleotides has advantages as well.

Somehow, I missed or didn't receive John's reply, I'm not sure if the 
parts Koen quotes contain the entire mail, so sorry beforehand for 
missing anything critical in my comments...

>> Just to let you guys know, the grant I run a database for us up for 
>> renewal,
>> and I've got a huge rebuild coming up over the next few days.  I 
>> probably
>> won't be able to do anything code-wise until the end of next week.
>>
>
> Good luck with the grant!
Let me copy that!

>> Looking up methods in an ObjC superclass is significantly slower
>> than having them in the object being used.
>
> I didn't know that. Is this only for ObjC, or also for other OO 
> languages, such as C++ and Java?

I was curious as well, and did some searching on the web on the 
Objective-C runtime. I found this snippet:

The method is looked up BY NAME, first in the class and then in each 
superclass in order. Once a successful lookup has occurred (or failed 
to occur) the associated function pointer (or pointer to the error 
function) is cached so that future invocations are fast (~ 2-3x a 
simple function call).

So it seems you're right in principle, but given the fact that most 
methods we're talking about are only one superclass up, and that after 
the first time, the function pointer is cached, I can hardly imagine 
that you will see anything of this.

>> Elegant implementation != elegant use.  A single class looks cleaner, 
>> but
>> all the complementation/representation abilities in the bases are 
>> extremely
>> useful, but make no sense applied to amino acids.
>
> That's where wrappers or decorators come in handy, like we already do 
> for translation and mass calculation, and later for digesting and 
> other stuff.
Yep.

>>  There's also going to be
>> a bunch of situations where you'd have to waste time asking what type 
>> of
>> sequence something is, which I personally would find extraordinarily
>> inelegant ;).
>
> If you do it just once before each calculation, I don't see how much 
> extra time that would take.  Check the sequence once, then do 
> something with 1000 symbols. You already are using a lot of type 
> testing in your nucleotide classes, so one more test wouldn't hurt, I 
> guess. In fact it could speed things up: if we know we are dealing 
> with a DNA sequence (because it was created by using a strict DNA 
> BCSymbolSet), then there is no need to test each symbol as well.
This is certainly true, as said I'm not particular in favour of 
switching to a alphabet based system, but it would be certainly handy 
to have a quick and inexpensive way of knowing whether a sequence is 
strict as it optimizes things enormously.

>> Although we can't prevent stupidity (and I'm an occasional 
>> practitioner), I think we should
>> try to design our classes to make some of these things a bit harder 
>> to do.
>> Separating classes, some basic error testing, and using ID as little 
>> as
>> possible should help a great deal.
>
> I agree :)
>
>> In most cases, it is probably a balancing act where there's no single 
>> right
>> answer, and the elegance/practicality decision will just be made by 
>> whoever
>> writes the class.
> I'm not sure if I agree here. I think we should agree on a similar 
> approach for all classes. That would make things more transparent not 
> only for us, but also for users.

I definitely copy this. Consistency rules! Perhaps a lot of discussion 
as the result, but in the end we have to decide how to do it and do it 
that way for all similar situations. Also, if later on we find a better 
solution, we should go back and change all classes to work similar. It 
will be much easier to learn how things work, and for a developer to 
implement things, but also it is the only way in which we can together 
optimize things. If all implementations are different, it would be 
impossible to improve those made by others.

Cheers,
Alex

*********************************************************
                     ** Alexander Griekspoor **
*********************************************************
               The Netherlands Cancer Institute
               Department of Tumorbiology (H4)
          Plesmanlaan 121, 1066 CX, Amsterdam
                   Tel:  + 31 20 - 512 2023
                   Fax:  + 31 20 - 512 2029
                   E-mail: a.griekspoor at nki.nl
	        AIM: mekentosj at mac.com
               Web: http://www.mekentosj.com

                  EnzymeX - To cut or not to cut
              http://www.mekentosj.com/enzymex

*********************************************************