[Biococoa-dev] BCSequence class cluster

Wed Jan 5 02:56:13 EST 2005

It seems the class cluster possibility has raised some interest. So I 
took some time to think it through and write some code. I got carried 
away and wrote a lot of it, and also I wrote this long email, but now 
you are used to those long emails:-)

Note that I am just proposing an implementation of a class cluster, 
and some solutions to potential pitfalls, but I am not saying that 
you should absolutely go with the class cluster design. I am a little 
biased in favor of it, but you should really decide if (1) you want 
to discuss it further and (2) discuss it further! Note that I mostly 
say 'you' when I talk about the developers, but maybe at some point, 
I should really start saying 'we' ;-) Anyway, for every sentence you 
read below, mentally add at the beginning "I may very well be wrong 
or missing something but it seems to me that maybe...".

Like I said before, several of the issues raised here apply to the 
existing code and you will have to deal with it at some point. The 
main point boils downs to the question of using a weakly typed object 
BCSequence vs using strongly objects belonging to one of the 
subclasses BCSequenceDNA/RNA/etc... Some of the code is a bit 
schizophrenic right now and tries to deal with both cases... The 
class cluster would favor the weakly typed route, and would make the 
design more consistent and simpler.

To follow the discussion, you can download a zipped Xcode project 
with some real code here:
http://cmgm.stanford.edu/~cparnot/temp/BCSequenceClassCluster.zip
Don't try to compile, it probably won't succeed. It is just easier to 
navigate the code in this familiar format.

OK, so how would a class cluster look like?

1. The user point of view
----------------------

For the user, there is only one class, called BCSequence. Instances 
are immutable and can be obtained with a number of factory methods, 
or using alloc followed by init methods. These are defined in the 
only header file accessible to the user, BCSequence.h (see attached 
project).

 From the user point of view, the usage is very simple: just create a 
sequence with one of the numerous factory or init methods, including 
reading from files. The instance you get back is immutable, but you 
can create new instances from it by removing/adding pieces, or 
transforming it to another type. You can always check the type and 
length, get the sequence back into a string or array of symbols. You 
can feed tools with that BCSequence instance and get the results, 
potentially getting back other instances of BCSequence.

There are 2 things the user could complain about:
   a- Some of the methods are only relevant for certain sequence types
   b- Sequence objects are immutable

About complaint (a)
In the header file BCSequence.h of the attached project, there are 2 
methods that are only relevant to a subset of the BCSequence type: 
-complement and -reverseComplement. This is not a really big concern 
at this point, because this is just 2 methods and it is quite easy to 
return something for all cases (for a protein, probably just return 
itself). But more methods in BCSequence or in the BCTools could give 
the same issues. For instance, BCToolDigest. That would only have 
sense on a DNA sequence when using restriction enzymes.
The class BCSequence would always return something, empty sequences 
in the worst case, leaving the troubles to the runtime. This is the 
only appropriate way to handle it with the class cluster design, 
maybe together with some error codes/handling mechanism.
But the user may want to be more specific about the BCSequence type 
and get some compiler warnings when appropriate, instead of leaving 
it to the runtime. The user might be ready to give up the simplicity 
of a unique class and use more specific types. This is the issue of 
weak vs strong typing, which relates to the issue of compiler vs 
runtime errors/warnings.
One possible answer is to say to the user: this is the way it is, 
just accept it!! And I believe as a first version, it is really OK. 
But there are also some ways to give the user the possibility to 
choose between strong and weak typing and keep the class cluster 
design, that I will explain later, below.

About complaint (b)
I thought of enforcing immutability as a starting point, as this is 
easier on the developer side to deal with immutable objects. Giving 
the option of immutability to the user is anyway a good thing, as it 
allows a number of optimizations, that could really pay off in a real 
application with lots of copying, ref passing,...
Of course, it is nice to also have mutable objects. I will address 
that on the developer point of view (see below). Note that 
ultimately, one thing would probably always be immutable: the 
sequence type.

2. Implementing the class cluster
------------------------------

The class cluster that I implement in the attached project looks very 
much like what you have already done. There is a superclass 
BCSequence, and then subclasses, BCSequenceDNA, 
BCSequenceRNA,...etc... plus a new special subclass 
BCSequenceFactory. Now the purpose of a class cluster is that the 
user just does everything using the public interface for BCSequence, 
and as far as the user is concerned, every object is an instance of 
BCSequence. But inside the hood, you actually return instances of one 
of the subclasses so that some operations can be optimized for the 
particular type of sequence you are dealing with.

The problem for the developer of a class cluster is that you know 
which subclass to use only once you call one of the init methods, but 
you still have to do the 'alloc' before the init. There is no way 
BCSequence will know what subclass it should use at the time 'alloc' 
is called. So the trick is to alloc a temporary instance of a 
particular subclass, a 'placeholder' class. Look at the 
implementation of 'alloc' in BCSequence.m. What this method returns 
is actually an instance of BCSequenceFactory when called on the 
superclass (when called on one of the subclass, though, it just 
passes the message up to NSObject). The bottom line is: you never 
create an instance of BCSequence, but an instance of 
BCSequenceFactory (you still alloc instances of BCSequence 
subclasses, of course). In fact, that BCSequenceFactory instance 
could be a singleton and never deallocated if we changed the code a 
little bit.

Then when one of the init method is called on that new 
BCSequenceFactory instance. This method actually allocs and inits a 
new object, an instance of the appropriate subclass. It then releases 
self and returns a pointer to the new object created. Because she 
should always use the value returned by init to set your pointers, 
the user will get the right object in the end.

To summarize, what happens when the user runs the following command:
BCSequence *mySeq = [[BCSequence alloc] initWithDNAString:aString];

You have the following happening
* [BCSequence alloc] returns an instance of BCSequenceFactory
* the message initWithDNAString:aString is sent to the 
BCSequenceFactory instance
* in the method, a second object is created by calling
	finalObject=[[BCSequenceDNA alloc] initWithString:aString]
* then the method calls [self release] to destroy the original 
BCSequenceFactory instance
* then the method returns the finalObject
* so now mySeq=final Object and is an instance of BCSequenceDNA

You get the same process when the user calls:
BCSequence *mySeq = [[BCSequence alloc] initWithString:aString];
except BCSequenceFactory first figures out to what subclass it should 
send the 'initWithString' message (using the same code as the 
original BCFactorySequence).

Then all the other methods are just convenience methods calling these 
building blocks.

Like for any superclass/subclass pattern, it is important to define 
what methods the subclasses should, may or should not override, and I 
have a summary of that in the attached project. It is very similar to 
what you have already done.

3. Pros and cons
---------------

What are the potential pitfalls and limitations:
(a) how to still provide the user with some more static typing when 
she wants more control over it? This is complaint (a) of part (1) 
above.
(b) how to provide mutable/immutable versions? This is complaint (b) 
of part (1) above.
(c) the class cluster assumes all the methods can be called on all 
the subclasses. Will that always be relevant? The case of 
'complement' is already a bit troublesome, and how about even worse 
cases, like 'digestWithRestrictionEnzyme:'. It does not make any 
sense for a protein, does it? The question is really: how does that 
fit with the BCTools? Could problem arise as we define more and more 
tools? Will it be that easy to add more private subclasses without 
breaking the existing code?
(d) What about the recent developments: does BCSymbolList fit in the 
picture? how do you add the annotation stuff to that?

I have answers to all of these, and I will come back to these 
different points below, in other parts of my email. And there might 
be other pitfalls I don't see yet.

But first, while writing the code and thinking about the whole 
concept, I also realized the potential benefits of a class cluster, 
and there are more than what I anticipated. Some of these benefits 
are really the benefits you get from OO, but are even more apparent 
with such a simple interface where things are even more encapsulated 
because it is almost like you have just one class:
* super simple interface for the user; she also gets the benefit of 
polymorphism without the need to know the existence of all the 
subclasses;
* because the public interface is reduced, the developer can make 
plenty of changes without breaking existing code developed by the user
* in particular, it allows the addition of new types of sequences or 
optimized subclasses for particular uses, that may in most cases 
already work with the code developed by the user; so the user can get 
new functionality for free
* the same is true for code developed by the developers of the framework:
- developers can work on other parts of the framework without knowing 
too much about the guts of BCSequence
- by relying on just one class for interactions between the different 
pieces of BioCocoa, it simplifies the development and minimize 
disruptions as modifications are made to BCSequence

I remember in the discussions, there was some disagreement about 
having subclasses (Alex's choice) or just one class which would 
decide what to do depending on the symbolSet used (Koen's choice); 
maybe a class cluster is a way to have many of the benefits of the 2 
systems without too many of the problems.
More about pros and cons of class cluster on the Apple web site:
http://developer.apple.com/documentation/Cocoa/Conceptual/CocoaObjects/Articles/ClassClusters.html

For me, the bottom line is still unclear. At present, I feel that a 
class cluster would work really well. But we have to anticipate now 
all the potential problems, and we should decide if it is worth it.

4. Compile vs runtime errors
--------------------------
This is a discussion about complaint (a) of part (1) and pitfall (a) 
of part (3). What if the user wants more control over the type of 
sequence it is using and want some compiler warnings when trying to 
cut a protein with EcoRI, or get its complementary sequence?

At this point, the class cluster does not allow that. All the methods 
are valid for all the sequence types. In this context, an invalid 
call will only be revealed at runtime, and a BCProtein object would 
have to decide at runtime to return something when sent an irrelevant 
message. What should it send back? This issue is actually slightly 
different from the discussion here and is discussed in part 6 (sorry 
this whole email is quite large and complicated; I am trying to keep 
it readable!). The question here is really: can we prevent that from 
even happening when the user knows what type of sequence she is 
dealing with and could get compiler warnings?

One way to help with that is to provide an additional set of headers 
defining some public classes named BCSequenceDNA, BCSequenceRNA,.... 
These classes would just be placeholders, and would be completely 
disctint from the subclasses of BCSequence (I will come back to the 
name conflict). They would have some init methods, but when the user 
uses these classes and alloc/init an instance, she would get in fact 
one of the BCSequence subclasses. The compiler would not know and 
would trust the headers to generate warning. For instance, the header 
for the BCSequenceProtein placeholder class would not define the 
methods 'complement' or 'cutWithRestrictionEnzyme:', and you would 
get a compiler warning even though the object would in fact respond 
to the methods at runtime (but would have to return some dummy 
values). So these headers would really define completely virtual 
classes. One of the problem is the names of these placeholder classes 
conflict with the names of the BCSequence private subclasses that are 
defined in the project I sent. We could rename the latter to 
BCSeqDNA/RNA/... for example, and keep the nice full names 
'BCSequenceDNA/RNA/...' for the placeholder public classes.

An alternative is to define protocols, and so the user would have to 
use (id <BCSequenceDNA>) in the code. The BCSequence would provide 
methods to return objects typed this way. It is a bit of a pain to 
type id <BCSequenceDNA> all the time and reduces readability, though.

So there are ways to solve the problem. Note that the problem is not 
really tied to the class cluster implementation and is already partly 
a problem that the current code is facing, as I talked about at the 
very beginning of the email (OK, now is a good time to reread 
everything!!).

Of course, the interface then becomes a bit schizophrenic, so it may 
not be such a good idea to allow all of that. At least in the 
beginning, there may be not such a high need for stronger typing, and 
this goes a bit against the whole idea of a simple interface and a 
class cluster.

5. Mutable and immutable instances
--------------------------------
This is a discussion about complaint (b) of part (1) and pitfall (b) 
of part (3).

Why impose immutable objects? Not sure.
This is not something I had thought of at first, but it is anyway an 
important issue that goes beyond the idea of class cluster. Immutable 
objects allows very important and basic optimizations, particularly 
when copying objects, and are sufficient for most uses. A smart user 
will use immutable objects whenever it can and will only go to 
mutable objects if really necessary. This is something we may have to 
think about for the BioCocoa project anyway. I am not saying it is 
absolutely necessary but it should be discussed (and maybe it has 
been??).

To implement mutable objects in the class cluster could be a bit 
tricky, because there are two conflicting subclass organizations 
here: mutable/immutable and dna/rna/protein/codon. To get all the 
combinations, it seems that we need 8 subclasses!!

I am not completely sure how to deal with it, or if we should deal 
with it or just give up and stick to mutable only. One possibility is 
to not have distinct subclasses for mutable/immutable. Instead, there 
could be simply a BOOL flag 'isMutable' as one of the instance 
variables. The object would then return different results in key 
methods such as 'copy' depending on the value of the flag. Also, at 
creation, it would create mutable or immutable instance variables 
(NSArray or NSMutableArray) depending on the value of that flag. It 
is OK to declare a mutable object as the instance variable and then 
actually use it to allocate an immutable object, as long as we are 
consistent in the methods called to avoid runtime errors (and we 
should use some casts to avoid compiler warnings).

6. Potential clashes in the future
--------------------------------
This is a discussion about pitfall (c) of part (3).
The problem is: will the class cluster ever become a problem in the 
future and force us to rewrite everything and lose our sleep?
The short answer is: I don't know!

I guess any pattern can get in the way in some unpredicted way at 
some unpredicted point in the future. We can try to anticipate those 
issues. In the case of the class cluster, some of the questions to 
answer are obviously: how do we deal with irrelevant messages sent to 
inappropriate subclasses, such as sending 'complement' to a 
BCSequenceProtein? how frequent these messages will be? how do we 
deal with new sequence types that could be introduced later? how 
frequently will new sequence types be needed?

The answer to that is to list as much as we can all the methods that 
would have to go in the final implementation of BCSequence and see 
how the current sequence types could deal with it. Also, we would 
have to think about what other types of sequences could be added in 
the future (which could be inspired by other BioX projects) and hope 
that a future BCSequenceExtraterrestrial won't break everything. This 
may have already been discussed earlier on the mailing list?

Some examples of how to deal with irrelevant methods:
* complement of a protein: return the same sequence; return an empty 
sequence; return nil??
* cut a protein with EcoRI: OK, this is easy, you just get the same 
protein!! Or do you get the sequence of the EcoRI protein!!!
* etc...

The existing code will have to deal with this anyway. When I look at 
the present code, I see you can return BCSequence objects without 
knowing the type, as returned by 'sequenceWithString:' in the 
BCSequenceFactory class. And then, this is allowed to get in the 
BCToolComplement with the method 'complementToolWithSequence:'. What 
if the BCSequence created is a protein? The abstraction that you did 
encode in BCSymbol already allows you to deal with it, you did a 
great job!

7. Full incorporation of the present implementation
----------------------------------------------
This is a discussion about pitfall (d) of part (3).

The implementation I attached to the email is quite basic and could 
be further refined to incorporate the features and organization of 
the current implementation and the short-term planned additions. The 
current class tree can  probably be used as is. One problem is the 
name BCSequence would be taken for the superclass; this is probably 
the name that should be public. Then we could have the following:
* BCSymbolList = subclass of BCSequence
* BCSeq = subclass of BCSymbolList with annotations
* BCSeqDNA, BCSeqRNA, etc... = subclasses of BCSeq with optimized 
methods for the different types of sequences

The additional benefit is that the instance variables would not even 
be in the public header anymore, but in the subclass BCSymbolList 
(and BCSequenceFactory would then be even lighter, with no instance 
variable at all). An alternative is to decide that BCSymbolList would 
actually be BCSequence, and the annotated BCSequence would become 
BCSeq.

It is thus mostly a problem of naming, which is somewhat secondary, 
but is still quite important because it would be here to stay and has 
to be easy to remember and logical...

An additional problem is that if you instantiate BCSymbolList (in the 
case of non-annotated sequences), you want to make sure that it can 
handle ALL the messages declared in the header. It is not clear to me 
yet that it can do it.

8. Happy new year!
------------------

... and thanks for reading this up to that point!

Charles

-- 
Charles Parnot
charles.parnot at stanford.edu

Help science go fast forward:
http://cmgm.stanford.edu/~cparnot/xgrid-stanford/

Room  B157 in Beckman Center
279, Campus Drive
Stanford University
Stanford, CA 94305 (USA)

Tel +1 650 725 7754
Fax +1 650 725 8021