[Biococoa-dev] BCSequence class cluster
Charles PARNOT
charles.parnot at stanford.edu
Wed Jan 5 02:56:13 EST 2005
It seems the class cluster possibility has raised some interest. So I
took some time to think it through and write some code. I got carried
away and wrote a lot of it, and also I wrote this long email, but now
you are used to those long emails:-)
Note that I am just proposing an implementation of a class cluster,
and some solutions to potential pitfalls, but I am not saying that
you should absolutely go with the class cluster design. I am a little
biased in favor of it, but you should really decide if (1) you want
to discuss it further and (2) discuss it further! Note that I mostly
say 'you' when I talk about the developers, but maybe at some point,
I should really start saying 'we' ;-) Anyway, for every sentence you
read below, mentally add at the beginning "I may very well be wrong
or missing something but it seems to me that maybe...".
Like I said before, several of the issues raised here apply to the
existing code and you will have to deal with it at some point. The
main point boils downs to the question of using a weakly typed object
BCSequence vs using strongly objects belonging to one of the
subclasses BCSequenceDNA/RNA/etc... Some of the code is a bit
schizophrenic right now and tries to deal with both cases... The
class cluster would favor the weakly typed route, and would make the
design more consistent and simpler.
To follow the discussion, you can download a zipped Xcode project
with some real code here:
http://cmgm.stanford.edu/~cparnot/temp/BCSequenceClassCluster.zip
Don't try to compile, it probably won't succeed. It is just easier to
navigate the code in this familiar format.
OK, so how would a class cluster look like?
1. The user point of view
----------------------
For the user, there is only one class, called BCSequence. Instances
are immutable and can be obtained with a number of factory methods,
or using alloc followed by init methods. These are defined in the
only header file accessible to the user, BCSequence.h (see attached
project).
From the user point of view, the usage is very simple: just create a
sequence with one of the numerous factory or init methods, including
reading from files. The instance you get back is immutable, but you
can create new instances from it by removing/adding pieces, or
transforming it to another type. You can always check the type and
length, get the sequence back into a string or array of symbols. You
can feed tools with that BCSequence instance and get the results,
potentially getting back other instances of BCSequence.
There are 2 things the user could complain about:
a- Some of the methods are only relevant for certain sequence types
b- Sequence objects are immutable
About complaint (a)
In the header file BCSequence.h of the attached project, there are 2
methods that are only relevant to a subset of the BCSequence type:
-complement and -reverseComplement. This is not a really big concern
at this point, because this is just 2 methods and it is quite easy to
return something for all cases (for a protein, probably just return
itself). But more methods in BCSequence or in the BCTools could give
the same issues. For instance, BCToolDigest. That would only have
sense on a DNA sequence when using restriction enzymes.
The class BCSequence would always return something, empty sequences
in the worst case, leaving the troubles to the runtime. This is the
only appropriate way to handle it with the class cluster design,
maybe together with some error codes/handling mechanism.
But the user may want to be more specific about the BCSequence type
and get some compiler warnings when appropriate, instead of leaving
it to the runtime. The user might be ready to give up the simplicity
of a unique class and use more specific types. This is the issue of
weak vs strong typing, which relates to the issue of compiler vs
runtime errors/warnings.
One possible answer is to say to the user: this is the way it is,
just accept it!! And I believe as a first version, it is really OK.
But there are also some ways to give the user the possibility to
choose between strong and weak typing and keep the class cluster
design, that I will explain later, below.
About complaint (b)
I thought of enforcing immutability as a starting point, as this is
easier on the developer side to deal with immutable objects. Giving
the option of immutability to the user is anyway a good thing, as it
allows a number of optimizations, that could really pay off in a real
application with lots of copying, ref passing,...
Of course, it is nice to also have mutable objects. I will address
that on the developer point of view (see below). Note that
ultimately, one thing would probably always be immutable: the
sequence type.
2. Implementing the class cluster
------------------------------
The class cluster that I implement in the attached project looks very
much like what you have already done. There is a superclass
BCSequence, and then subclasses, BCSequenceDNA,
BCSequenceRNA,...etc... plus a new special subclass
BCSequenceFactory. Now the purpose of a class cluster is that the
user just does everything using the public interface for BCSequence,
and as far as the user is concerned, every object is an instance of
BCSequence. But inside the hood, you actually return instances of one
of the subclasses so that some operations can be optimized for the
particular type of sequence you are dealing with.
The problem for the developer of a class cluster is that you know
which subclass to use only once you call one of the init methods, but
you still have to do the 'alloc' before the init. There is no way
BCSequence will know what subclass it should use at the time 'alloc'
is called. So the trick is to alloc a temporary instance of a
particular subclass, a 'placeholder' class. Look at the
implementation of 'alloc' in BCSequence.m. What this method returns
is actually an instance of BCSequenceFactory when called on the
superclass (when called on one of the subclass, though, it just
passes the message up to NSObject). The bottom line is: you never
create an instance of BCSequence, but an instance of
BCSequenceFactory (you still alloc instances of BCSequence
subclasses, of course). In fact, that BCSequenceFactory instance
could be a singleton and never deallocated if we changed the code a
little bit.
Then when one of the init method is called on that new
BCSequenceFactory instance. This method actually allocs and inits a
new object, an instance of the appropriate subclass. It then releases
self and returns a pointer to the new object created. Because she
should always use the value returned by init to set your pointers,
the user will get the right object in the end.
To summarize, what happens when the user runs the following command:
BCSequence *mySeq = [[BCSequence alloc] initWithDNAString:aString];
You have the following happening
* [BCSequence alloc] returns an instance of BCSequenceFactory
* the message initWithDNAString:aString is sent to the
BCSequenceFactory instance
* in the method, a second object is created by calling
finalObject=[[BCSequenceDNA alloc] initWithString:aString]
* then the method calls [self release] to destroy the original
BCSequenceFactory instance
* then the method returns the finalObject
* so now mySeq=final Object and is an instance of BCSequenceDNA
You get the same process when the user calls:
BCSequence *mySeq = [[BCSequence alloc] initWithString:aString];
except BCSequenceFactory first figures out to what subclass it should
send the 'initWithString' message (using the same code as the
original BCFactorySequence).
Then all the other methods are just convenience methods calling these
building blocks.
Like for any superclass/subclass pattern, it is important to define
what methods the subclasses should, may or should not override, and I
have a summary of that in the attached project. It is very similar to
what you have already done.
3. Pros and cons
---------------
What are the potential pitfalls and limitations:
(a) how to still provide the user with some more static typing when
she wants more control over it? This is complaint (a) of part (1)
above.
(b) how to provide mutable/immutable versions? This is complaint (b)
of part (1) above.
(c) the class cluster assumes all the methods can be called on all
the subclasses. Will that always be relevant? The case of
'complement' is already a bit troublesome, and how about even worse
cases, like 'digestWithRestrictionEnzyme:'. It does not make any
sense for a protein, does it? The question is really: how does that
fit with the BCTools? Could problem arise as we define more and more
tools? Will it be that easy to add more private subclasses without
breaking the existing code?
(d) What about the recent developments: does BCSymbolList fit in the
picture? how do you add the annotation stuff to that?
I have answers to all of these, and I will come back to these
different points below, in other parts of my email. And there might
be other pitfalls I don't see yet.
But first, while writing the code and thinking about the whole
concept, I also realized the potential benefits of a class cluster,
and there are more than what I anticipated. Some of these benefits
are really the benefits you get from OO, but are even more apparent
with such a simple interface where things are even more encapsulated
because it is almost like you have just one class:
* super simple interface for the user; she also gets the benefit of
polymorphism without the need to know the existence of all the
subclasses;
* because the public interface is reduced, the developer can make
plenty of changes without breaking existing code developed by the user
* in particular, it allows the addition of new types of sequences or
optimized subclasses for particular uses, that may in most cases
already work with the code developed by the user; so the user can get
new functionality for free
* the same is true for code developed by the developers of the framework:
- developers can work on other parts of the framework without knowing
too much about the guts of BCSequence
- by relying on just one class for interactions between the different
pieces of BioCocoa, it simplifies the development and minimize
disruptions as modifications are made to BCSequence
I remember in the discussions, there was some disagreement about
having subclasses (Alex's choice) or just one class which would
decide what to do depending on the symbolSet used (Koen's choice);
maybe a class cluster is a way to have many of the benefits of the 2
systems without too many of the problems.
More about pros and cons of class cluster on the Apple web site:
http://developer.apple.com/documentation/Cocoa/Conceptual/CocoaObjects/Articles/ClassClusters.html
For me, the bottom line is still unclear. At present, I feel that a
class cluster would work really well. But we have to anticipate now
all the potential problems, and we should decide if it is worth it.
4. Compile vs runtime errors
--------------------------
This is a discussion about complaint (a) of part (1) and pitfall (a)
of part (3). What if the user wants more control over the type of
sequence it is using and want some compiler warnings when trying to
cut a protein with EcoRI, or get its complementary sequence?
At this point, the class cluster does not allow that. All the methods
are valid for all the sequence types. In this context, an invalid
call will only be revealed at runtime, and a BCProtein object would
have to decide at runtime to return something when sent an irrelevant
message. What should it send back? This issue is actually slightly
different from the discussion here and is discussed in part 6 (sorry
this whole email is quite large and complicated; I am trying to keep
it readable!). The question here is really: can we prevent that from
even happening when the user knows what type of sequence she is
dealing with and could get compiler warnings?
One way to help with that is to provide an additional set of headers
defining some public classes named BCSequenceDNA, BCSequenceRNA,....
These classes would just be placeholders, and would be completely
disctint from the subclasses of BCSequence (I will come back to the
name conflict). They would have some init methods, but when the user
uses these classes and alloc/init an instance, she would get in fact
one of the BCSequence subclasses. The compiler would not know and
would trust the headers to generate warning. For instance, the header
for the BCSequenceProtein placeholder class would not define the
methods 'complement' or 'cutWithRestrictionEnzyme:', and you would
get a compiler warning even though the object would in fact respond
to the methods at runtime (but would have to return some dummy
values). So these headers would really define completely virtual
classes. One of the problem is the names of these placeholder classes
conflict with the names of the BCSequence private subclasses that are
defined in the project I sent. We could rename the latter to
BCSeqDNA/RNA/... for example, and keep the nice full names
'BCSequenceDNA/RNA/...' for the placeholder public classes.
An alternative is to define protocols, and so the user would have to
use (id <BCSequenceDNA>) in the code. The BCSequence would provide
methods to return objects typed this way. It is a bit of a pain to
type id <BCSequenceDNA> all the time and reduces readability, though.
So there are ways to solve the problem. Note that the problem is not
really tied to the class cluster implementation and is already partly
a problem that the current code is facing, as I talked about at the
very beginning of the email (OK, now is a good time to reread
everything!!).
Of course, the interface then becomes a bit schizophrenic, so it may
not be such a good idea to allow all of that. At least in the
beginning, there may be not such a high need for stronger typing, and
this goes a bit against the whole idea of a simple interface and a
class cluster.
5. Mutable and immutable instances
--------------------------------
This is a discussion about complaint (b) of part (1) and pitfall (b)
of part (3).
Why impose immutable objects? Not sure.
This is not something I had thought of at first, but it is anyway an
important issue that goes beyond the idea of class cluster. Immutable
objects allows very important and basic optimizations, particularly
when copying objects, and are sufficient for most uses. A smart user
will use immutable objects whenever it can and will only go to
mutable objects if really necessary. This is something we may have to
think about for the BioCocoa project anyway. I am not saying it is
absolutely necessary but it should be discussed (and maybe it has
been??).
To implement mutable objects in the class cluster could be a bit
tricky, because there are two conflicting subclass organizations
here: mutable/immutable and dna/rna/protein/codon. To get all the
combinations, it seems that we need 8 subclasses!!
I am not completely sure how to deal with it, or if we should deal
with it or just give up and stick to mutable only. One possibility is
to not have distinct subclasses for mutable/immutable. Instead, there
could be simply a BOOL flag 'isMutable' as one of the instance
variables. The object would then return different results in key
methods such as 'copy' depending on the value of the flag. Also, at
creation, it would create mutable or immutable instance variables
(NSArray or NSMutableArray) depending on the value of that flag. It
is OK to declare a mutable object as the instance variable and then
actually use it to allocate an immutable object, as long as we are
consistent in the methods called to avoid runtime errors (and we
should use some casts to avoid compiler warnings).
6. Potential clashes in the future
--------------------------------
This is a discussion about pitfall (c) of part (3).
The problem is: will the class cluster ever become a problem in the
future and force us to rewrite everything and lose our sleep?
The short answer is: I don't know!
I guess any pattern can get in the way in some unpredicted way at
some unpredicted point in the future. We can try to anticipate those
issues. In the case of the class cluster, some of the questions to
answer are obviously: how do we deal with irrelevant messages sent to
inappropriate subclasses, such as sending 'complement' to a
BCSequenceProtein? how frequent these messages will be? how do we
deal with new sequence types that could be introduced later? how
frequently will new sequence types be needed?
The answer to that is to list as much as we can all the methods that
would have to go in the final implementation of BCSequence and see
how the current sequence types could deal with it. Also, we would
have to think about what other types of sequences could be added in
the future (which could be inspired by other BioX projects) and hope
that a future BCSequenceExtraterrestrial won't break everything. This
may have already been discussed earlier on the mailing list?
Some examples of how to deal with irrelevant methods:
* complement of a protein: return the same sequence; return an empty
sequence; return nil??
* cut a protein with EcoRI: OK, this is easy, you just get the same
protein!! Or do you get the sequence of the EcoRI protein!!!
* etc...
The existing code will have to deal with this anyway. When I look at
the present code, I see you can return BCSequence objects without
knowing the type, as returned by 'sequenceWithString:' in the
BCSequenceFactory class. And then, this is allowed to get in the
BCToolComplement with the method 'complementToolWithSequence:'. What
if the BCSequence created is a protein? The abstraction that you did
encode in BCSymbol already allows you to deal with it, you did a
great job!
7. Full incorporation of the present implementation
----------------------------------------------
This is a discussion about pitfall (d) of part (3).
The implementation I attached to the email is quite basic and could
be further refined to incorporate the features and organization of
the current implementation and the short-term planned additions. The
current class tree can probably be used as is. One problem is the
name BCSequence would be taken for the superclass; this is probably
the name that should be public. Then we could have the following:
* BCSymbolList = subclass of BCSequence
* BCSeq = subclass of BCSymbolList with annotations
* BCSeqDNA, BCSeqRNA, etc... = subclasses of BCSeq with optimized
methods for the different types of sequences
The additional benefit is that the instance variables would not even
be in the public header anymore, but in the subclass BCSymbolList
(and BCSequenceFactory would then be even lighter, with no instance
variable at all). An alternative is to decide that BCSymbolList would
actually be BCSequence, and the annotated BCSequence would become
BCSeq.
It is thus mostly a problem of naming, which is somewhat secondary,
but is still quite important because it would be here to stay and has
to be easy to remember and logical...
An additional problem is that if you instantiate BCSymbolList (in the
case of non-annotated sequences), you want to make sure that it can
handle ALL the messages declared in the header. It is not clear to me
yet that it can do it.
8. Happy new year!
------------------
... and thanks for reading this up to that point!
Charles
--
Charles Parnot
charles.parnot at stanford.edu
Help science go fast forward:
http://cmgm.stanford.edu/~cparnot/xgrid-stanford/
Room B157 in Beckman Center
279, Campus Drive
Stanford University
Stanford, CA 94305 (USA)
Tel +1 650 725 7754
Fax +1 650 725 8021
More information about the Biococoa-dev
mailing list