[Biococoa-dev] Sequence Structure

Mon Jul 11 14:08:05 EDT 2005

> I seem to be doing my best thinking on the subway these days - on the
> commute in, I thought about how to possibly handle this, and here's a
> potential solution:
>
> We do create a lightweight, high performance sequence object that's  
> untyped.
> Basically, it acts as a specialized NSArray for sequences.  The  
> tools focus
> on working with this object, since they will be performing the  
> processor
> intensive operations, and this is designed for performance.  I  
> rework the
> existing sequence subclasses to be holders for this.  Convenience  
> calls
> through to the tools put a "smart" interface on the otherwise stupid
> sequence object.
>
> This is not ideal, as it creates a lot more call-throughs to  
> another class.
> That's not such a problem, though, as most of those call-throughs  
> would have
> gone to NSArray or tool classes in the current structure anyway.   
> It also
> creates design decisions - when a file is read, does it create a  
> sequence or
> a typed sequence holder?  Should we create methods to do both?   
> What about
> annotated sequences - should they hold one or both types of sequence
> objects?  Fortunately, the option of creating the appropriate type of
> sequence object on the fly should let us keep both around, as needed.

Lately, the consensus was that we should not have both untyped and  
typed sequence classes at the same time, because it is confusing for  
the user and even the developers of the framework. I personally don't  
think it would be that confusing if things are clearly explained and/ 
or exposed at different levels. For instance, typed sequences could  
be for "the experts". Kind of like CFArray and NSArray. BTW, which  
are toll-free bridged.

Then there are different ways to implement this. The structure that  
we have now is one. What you are proposing is another, and might be  
easier to understand at least from the BioCocoa developer point of  
view. The important thing is that the two worlds (typed and untyped)  
are separate from the user and compiler perspective, BUT avoid code  
duplication in the implementation. This is a hard challenge.  The  
only way to do it is indeed to either wrap one of the object inside  
the other like you propose, or use the placeholder trick I set up in  
the current design; in other word, one of the object is the "real"  
one, the "master" implementation, and the other is just using it and  
putting a fake interface in front of it. So in the end, the public  
interface look like there are 2 different kind of objects. But  
internally, there is really only one, so that any change in the  
implementation of the 'master' object is automatically used by the  
other one.

To come back to your proposition, it is symmetric to the current  
implementation. Currently, the typed classes are the "real" objects,  
while the BCSequence is just a placeholder and internally generates  
these typed objects. You are proposing the opposite. The one-for-all  
unique BCSequence class is where implementation is, and the typed  
classes would just be wrappers around instances of it.

I do think the concept would be easier to understand than the way it  
is now. Maybe we could work it out to be like an "extension" not  
included in the BCFoundation header, but just as an additional header  
(the binary could still be part of BCFoundation, so the user would  
not have to link against an additional framework, but simply to  
#import an additional header, only for the compiler benefit). This  
header would declare the following classes: BCTypedSequence (root,  
inherits from NSObject), BCDNASequence, BCRNASequence,... It is  
important that these classes are not subclasses of BCSequence,  
because type-specific methods such as '-complement' are declared in  
the BCSequence header and will be recognized as valid for all  
subclasses. And you don't want the compiler to think that  
BCProteinSequence can respond to the message.

This would work with the wrapper design your propose. I see 2  
problems with the wrapper design, though:
* you add an additional layer to the call stack, which you mention;  
in most cases, it should be OK and won't have much effect of  
performance; but it is still there
* more problematic is that for every method of BCSequence, such as '- 
complement', '-reverse', 'subsequence',... you need to write a method  
for the wrapper that call the BCSequence method. This is a lot of  
code. One way around it is to use the -forward trick, but that adds a  
lot of overhead and may not be that easy to set up (we could  
certainly consider it, though).

Rather than a wrapper, I propose we use the placeholder trick ;-) All  
you have to write are the init methods, and return a BCSequence  
object from these. All the code can be in the superclass  
'BCTypedSequence', and the subclasses BCDNASequence,... are just  
empty shells, only there for the headers (actually, they might just  
need a trivial '-sequenceType' method that the superclass can call to  
do the right init). So, the instance returned by the init methods  
would in fact be a BCSequence object, ready to respond to all the  
methods implemented there. And it would respond to any method we add  
in the future without additional code (we would just have to keep the  
header in sync). Of course, one thing you can't do this way is to  
throw an exception when you call the wrong method on the wrong type  
of sequence, like calling '-complement' on a protein. But you get a  
compiler warning, which is the most important part. If you ignore it,  
you only get what you deserve if your app has a weird behavior!!

Let me add the mandatory OmniGraffle thingie:
http://cmgm.stanford.edu/~cparnot/temp/typed-sequences.png

What do you think?

charles

--
Xgrid-at-Stanford
Help science move fast forward:
http://cmgm.stanford.edu/~cparnot/xgrid-stanford

Charles Parnot
charles.parnot at gmail.com