Fwd: [Biococoa-dev] BCSequence class cluster

Alexander Griekspoor mek at mekentosj.com
Wed Jan 5 06:46:47 EST 2005

Wow, you have been quite busy Charles, brilliant!  Happy new year  

Just one thing I found recently by pure coincident but very related:

Op 5-jan-05 om 8:56 heeft Charles PARNOT het volgende geschreven:

> It seems the class cluster possibility has raised some interest. So I  
> took some time to think it through and write some code. I got carried  
> away and wrote a lot of it, and also I wrote this long email, but now  
> you are used to those long emails:-)
I like those ;-)
*** ADDED: just got my email back that it's to big to post on the list,  
so I'll cut some things away..
> Note that I am just proposing an implementation of a class cluster,  
> and some solutions to potential pitfalls, but I am not saying that you  
> should absolutely go with the class cluster design. I am a little  
> biased in favor of it, but you should really decide if (1) you want to  
> discuss it further and (2) discuss it further! Note that I mostly say  
> 'you' when I talk about the developers, but maybe at some point, I  
> should really start saying 'we' ;-)
Yep, you definitely got stuck in here, haha.

> Like I said before, several of the issues raised here apply to the  
> existing code and you will have to deal with it at some point. The  
> main point boils downs to the question of using a weakly typed object  
> BCSequence vs using strongly objects belonging to one of the  
> subclasses BCSequenceDNA/RNA/etc... Some of the code is a bit  
> schizophrenic right now and tries to deal with both cases... The class  
> cluster would favor the weakly typed route, and would make the design  
> more consistent and simpler.
Which certainly is a good thing.

> OK, so how would a class cluster look like?
> 1. The user point of view
> ----------------------

> There are 2 things the user could complain about:
>   a- Some of the methods are only relevant for certain sequence types
>   b- Sequence objects are immutable
> About complaint (a)
> In the header file BCSequence.h of the attached project, there are 2  
> methods that are only relevant to a subset of the BCSequence type:  
> -complement and -reverseComplement. This is not a really big concern  
> at this point, because this is just 2 methods and it is quite easy to  
> return something for all cases (for a protein, probably just return  
> itself). But more methods in BCSequence or in the BCTools could give  
> the same issues. For instance, BCToolDigest. That would only have  
> sense on a DNA sequence when using restriction enzymes.

> The class BCSequence would always return something, empty sequences in  
> the worst case, leaving the troubles to the runtime. This is the only  
> appropriate way to handle it with the class cluster design, maybe  
> together with some error codes/handling mechanism.
> But the user may want to be more specific about the BCSequence type  
> and get some compiler warnings when appropriate, instead of leaving it  
> to the runtime. The user might be ready to give up the simplicity of a  
> unique class and use more specific types. This is the issue of weak vs  
> strong typing, which relates to the issue of compiler vs runtime  
> errors/warnings.
True, that's the big issue.

> One possible answer is to say to the user: this is the way it is, just  
> accept it!! And I believe as a first version, it is really OK.
I agree, some rules simply come with the system.

>  But there are also some ways to give the user the possibility to  
> choose between strong and weak typing and keep the class cluster  
> design, that I will explain later, below.
> About complaint (b)
> I thought of enforcing immutability as a starting point, as this is  
> easier on the developer side to deal with immutable objects. Giving  
> the option of immutability to the user is anyway a good thing, as it  
> allows a number of optimizations, that could really pay off in a real  
> application with lots of copying, ref passing,...
Yes, this is exactly how the mutable variants of NSData, NSString etc  
are setup as I discovered in the devnote I mentioned above. Indeed, it  
would be very nice to have a mutable and immutable variant of  
BCSequence objects.

> Of course, it is nice to also have mutable objects.
Definitely! With large sequences you certainly don't want to copy them  
all the time to new objects.

> I will address that on the developer point of view (see below). Note  
> that ultimately, one thing would probably always be immutable: the  
> sequence type.
> 2. Implementing the class cluster
> ------------------------------
> The class cluster that I implement in the attached project looks very  
> much like what you have already done. There is a superclass  
> BCSequence, and then subclasses, BCSequenceDNA,  
> BCSequenceRNA,...etc... plus a new special subclass BCSequenceFactory.  
> Now the purpose of a class cluster is that the user just does  
> everything using the public interface for BCSequence, and as far as  
> the user is concerned, every object is an instance of BCSequence. But  
> inside the hood, you actually return instances of one of the  
> subclasses so that some operations can be optimized for the particular  
> type of sequence you are dealing with.
In other words the subclasses are private, only BCSequence.h is public  
> The problem for the developer of a class cluster is that you know  
> which subclass to use only once you call one of the init methods, but  
> you still have to do the 'alloc' before the init. There is no way  
> BCSequence will know what subclass it should use at the time 'alloc'  
> is called. So the trick is to alloc a temporary instance of a  
> particular subclass, a 'placeholder' class. Look at the implementation  
> of 'alloc' in BCSequence.m.
+ (id)alloc
	if (self==[BCKSequence class]  // Should this be [BCSequence class]?
		return [BCKSequencePlaceholder alloc];  // So this would be  
[BCSequenceFactory alloc]?
		return [super alloc];

> What this method returns is actually an instance of BCSequenceFactory  
> when called on the superclass (when called on one of the subclass,  
> though, it just passes the message up to NSObject). The bottom line  
> is: you never create an instance of BCSequence, but an instance of  
> BCSequenceFactory (you still alloc instances of BCSequence subclasses,  
> of course). In fact, that BCSequenceFactory instance could be a  
> singleton and never deallocated if we changed the code a little bit.
> Then when one of the init method is called on that new  
> BCSequenceFactory instance. This method actually allocs and inits a  
> new object, an instance of the appropriate subclass. It then releases  
> self and returns a pointer to the new object created. Because she  
> should always use the value returned by init to set your pointers, the  
> user will get the right object in the end.
OK, I get it, looks very nice!
> .
> Like for any superclass/subclass pattern, it is important to define  
> what methods the subclasses should, may or should not override, and I  
> have a summary of that in the attached project. It is very similar to  
> what you have already done.
Yep, guess that's easy to headerdoc along with every method
> 3. Pros and cons
> ---------------
> But first, while writing the code and thinking about the whole  
> concept, I also realized the potential benefits of a class cluster,  
> and there are more than what I anticipated. Some of these benefits are  
> really the benefits you get from OO, but are even more apparent with  
> such a simple interface where things are even more encapsulated  
> because it is almost like you have just one class:
> * super simple interface for the user; she also gets the benefit of  
> polymorphism without the need to know the existence of all the  
> subclasses;
That's even a big advantage for us ;-) Think in terms of tutorials and  

> * because the public interface is reduced, the developer can make  
> plenty of changes without breaking existing code developed by the user

> * in particular, it allows the addition of new types of sequences or  
> optimized subclasses for particular uses, that may in most cases  
> already work with the code developed by the user; so the user can get  
> new functionality for free
Exactly, like adding the mutable variants

> I remember in the discussions, there was some disagreement about  
> having subclasses (Alex's choice) or just one class which would decide  
> what to do depending on the symbolSet used (Koen's choice); maybe a  
> class cluster is a way to have many of the benefits of the 2 systems  
> without too many of the problems.
> More about pros and cons of class cluster on the Apple web site:
> http://developer.apple.com/documentation/Cocoa/Conceptual/ 
> CocoaObjects/Articles/ClassClusters.html
Aha, maybe should have read the whole thing first. I like to approach  
these long emails more as conversations, commenting along the way so  
everyone can follow my (sometimes twisted) thoughts ;-)
> For me, the bottom line is still unclear. At present, I feel that a  
> class cluster would work really well. But we have to anticipate now  
> all the potential problems, and we should decide if it is worth it.
That's exactly my thought at the moment, indeed it fits nicely in  
between the two opposite choices in the subclassing debate and  
satisfies  most arguments. The only problem is that I don't have a real  
oversight to see potential problems coming, but that's simply because  
of my inexperience with programming. Perhaps we just have to take the  
jump and see where it ends, at least it has proven very effective in  
the cocoa framework (wow, that's a biased opinion ;-).

> 4. Compile vs runtime errors
> --------------------------
> This is a discussion about complaint (a) of part (1) and pitfall (a)  
> of part (3). What if the user wants more control over the type of  
> sequence it is using and want some compiler warnings when trying to  
> cut a protein with EcoRI, or get its complementary sequence?
> At this point, the class cluster does not allow that. All the methods  
> are valid for all the sequence types. In this context, an invalid call  
> will only be revealed at runtime, and a BCProtein object would have to  
> decide at runtime to return something when sent an irrelevant message.  
> What should it send back? This issue is actually slightly different  
> from the discussion here and is discussed in part 6 (sorry this whole  
> email is quite large and complicated; I am trying to keep it  
> readable!).
Hanging it here.. Still around.. ;-)

> The question here is really: can we prevent that from even happening  
> when the user knows what type of sequence she is dealing with and  
> could get compiler warnings?
> One way to help with that is to provide an additional set of headers  
> defining some public classes named BCSequenceDNA, BCSequenceRNA,....  
> These classes would just be placeholders, and would be completely  
> disctint from the subclasses of BCSequence (I will come back to the  
> name conflict).
Good idea.

> They would have some init methods, but when the user uses these  
> classes and alloc/init an instance, she would get in fact one of the  
> BCSequence subclasses. The compiler would not know and would trust the  
> headers to generate warning. For instance, the header for the  
> BCSequenceProtein placeholder class would not define the methods  
> 'complement' or 'cutWithRestrictionEnzyme:', and you would get a  
> compiler warning even though the object would in fact respond to the  
> methods at runtime (but would have to return some dummy values). So  
> these headers would really define completely virtual classes. One of  
> the problem is the names of these placeholder classes conflict with  
> the names of the BCSequence private subclasses that are defined in the  
> project I sent. We could rename the latter to BCSeqDNA/RNA/... for  
> example, and keep the nice full names 'BCSequenceDNA/RNA/...' for the  
> placeholder public classes.
Seems feasible, although having separate names for internal vs public  
representations might be troublesome.
> An alternative is to define protocols, and so the user would have to  
> use (id <BCSequenceDNA>) in the code. The BCSequence would provide  
> methods to return objects typed this way. It is a bit of a pain to  
> type id <BCSequenceDNA> all the time and reduces readability, though.
Yes, that's painful.
> So there are ways to solve the problem. Note that the problem is not  
> really tied to the class cluster implementation and is already partly  
> a problem that the current code is facing, as I talked about at the  
> very beginning of the email (OK, now is a good time to reread  
> everything!!).
> Of course, the interface then becomes a bit schizophrenic, so it may  
> not be such a good idea to allow all of that. At least in the  
> beginning, there may be not such a high need for stronger typing, and  
> this goes a bit against the whole idea of a simple interface and a  
> class cluster.

Perhaps you're right, but what I was thinking is to implement a way to  
better return the reason why something don't work instead of a simple  
nil. For instance, calling cutInPiecesWithThisRestrictionEnzyme on a  
DNA would return the pieces, while it would also work on proteins, but  
return nil right. Of course you could also let the method return an  
exception, it will then become the developers responsibility to call  
methods on the right object. The downside is that this might lead to  
easily to program halts/crashes if the developer doesn't pay attention.  
But think in terms of NSArray objectAtIndex method, it returns nil if  
you ask an object out of bounds, AND raises an Exception.
I'm still wondering a bit how we're going to implement these kind of  
methods, as we now have to start ALL methods with a test what the  
sequence type is.
> 5. Mutable and immutable instances
> --------------------------------
> This is a discussion about complaint (b) of part (1) and pitfall (b)  
> of part (3).
> Why impose immutable objects? Not sure.
> This is not something I had thought of at first, but it is anyway an  
> important issue that goes beyond the idea of class cluster. Immutable  
> objects allows very important and basic optimizations, particularly  
> when copying objects, and are sufficient for most uses. A smart user  
> will use immutable objects whenever it can and will only go to mutable  
> objects if really necessary. This is something we may have to think  
> about for the BioCocoa project anyway. I am not saying it is  
> absolutely necessary but it should be discussed (and maybe it has  
> been??).
I've been in favour of both mutable of immutable bcsequences from the  
beginning, didn't know how to implement it in a simple way however ;-)
> To implement mutable objects in the class cluster could be a bit  
> tricky, because there are two conflicting subclass organizations here:  
> mutable/immutable and dna/rna/protein/codon. To get all the  
> combinations, it seems that we need 8 subclasses!!
Oops, Koen won't like this, LOL ;-) On the other hand, look at the  
number of NSNumber subclasses...
> I am not completely sure how to deal with it, or if we should deal  
> with it or just give up and stick to mutable only. One possibility is  
> to not have distinct subclasses for mutable/immutable. Instead, there  
> could be simply a BOOL flag 'isMutable' as one of the instance  
> variables. The object would then return different results in key  
> methods such as 'copy' depending on the value of the flag.
But then we could just as well do the subclasses right?

> Also, at creation, it would create mutable or immutable instance  
> variables (NSArray or NSMutableArray) depending on the value of that  
> flag. It is OK to declare a mutable object as the instance variable  
> and then actually use it to allocate an immutable object, as long as  
> we are consistent in the methods called to avoid runtime errors (and  
> we should use some casts to avoid compiler warnings).
I think the choice in this system is simple, either the subclass or a  
mutable variant only.
> 6. Potential clashes in the future
> --------------------------------
> This is a discussion about pitfall (c) of part (3).
> The problem is: will the class cluster ever become a problem in the  
> future and force us to rewrite everything and lose our sleep?
> The short answer is: I don't know!
Me neither.
> I guess any pattern can get in the way in some unpredicted way at some  
> unpredicted point in the future.
Or now already, look at the discussion about subclassing.

> 7. Full incorporation of the present implementation
> ----------------------------------------------

> It is thus mostly a problem of naming, which is somewhat secondary,  
> but is still quite important because it would be here to stay and has  
> to be easy to remember and logical...
> An additional problem is that if you instantiate BCSymbolList (in the  
> case of non-annotated sequences), you want to make sure that it can  
> handle ALL the messages declared in the header. It is not clear to me  
> yet that it can do it.

Let's first decide if we all like the idea of the class cluster, and  
then see how to implement it and the naming. Just one thing you might  
have thought about as well Charles, how do you see the annotations  
stuff fitting in this scheme? The nice thing is that it applies to all  
subclasses, but can it still be implemented in the superclass? Perhaps  
not, as the mutable vs immutable implementation will be quite  
different. And that's where my major doubt is, as you mentioned you  
have both a divergency in the direction of mutable vs immutable, as  
well as in DNA/RNA/Protein. This automatically leads to duplication of  
the code in one of the two directions I'm afraid...
There's plenty to discuss ;-)

> 8. Happy new year!
> ------------------
> ... and thanks for reading this up to that point!

It was a pleasure!

                     ** Alexander Griekspoor **
               The Netherlands Cancer Institute
               Department of Tumorbiology (H4)
          Plesmanlaan 121, 1066 CX, Amsterdam
                   Tel:  + 31 20 - 512 2023
                   Fax:  + 31 20 - 512 2029
                   AIM: mekentosj at mac.com
                   E-mail: a.griekspoor at nki.nl
               Web: http://www.mekentosj.com

                           Windows vs Mac
	65 million years ago, there were more
                      dinosaurs than humans.
	     Where are the dinosaurs now?

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 19240 bytes
Desc: not available
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20050105/88fc5c28/attachment.bin>

More information about the Biococoa-dev mailing list