[Biococoa-dev] BCSequence class cluster

Fri Jan 7 05:27:48 EST 2005

Hi Charles,

First of all, happy New Year to you too!

Thanks a lot for all the work, both the coding and the research you did  
about the future directions of BioCocoa. I'm a big fan of the Class  
cluster approach as this keeps the interface very simple. The biggest  
problem with this approach - as I see it now - is that some BCTools  
will only work with/on some sequence types. In that respect, I'd prefer  
your proposal to provide an additional set of headers defining some  
public classes as placeholders over the protocol approach. The  
placeholder approach will make/keep code much more readable indeed.

It seems that the mutability problem can be solved by either the  
subclasses or the mutable variant. While the mutable variant will  
reduce the number of classes, it will make the code in these classes  
less readable (depending on the number of optimizations we decide to  
implement). I think something could be said for either solution, I  
don't really have an opinion about this one.

Thanks again for your valuable contribution to BioCocoa!

Best wishes,

Peter

On 05 Jan 2005, at 08:56, Charles PARNOT wrote:

> It seems the class cluster possibility has raised some interest. So I  
> took some time to think it through and write some code. I got carried  
> away and wrote a lot of it, and also I wrote this long email, but now  
> you are used to those long emails:-)
>
> Note that I am just proposing an implementation of a class cluster,  
> and some solutions to potential pitfalls, but I am not saying that you  
> should absolutely go with the class cluster design. I am a little  
> biased in favor of it, but you should really decide if (1) you want to  
> discuss it further and (2) discuss it further! Note that I mostly say  
> 'you' when I talk about the developers, but maybe at some point, I  
> should really start saying 'we' ;-) Anyway, for every sentence you  
> read below, mentally add at the beginning "I may very well be wrong or  
> missing something but it seems to me that maybe...".
>
>
>
> Like I said before, several of the issues raised here apply to the  
> existing code and you will have to deal with it at some point. The  
> main point boils downs to the question of using a weakly typed object  
> BCSequence vs using strongly objects belonging to one of the  
> subclasses BCSequenceDNA/RNA/etc... Some of the code is a bit  
> schizophrenic right now and tries to deal with both cases... The class  
> cluster would favor the weakly typed route, and would make the design  
> more consistent and simpler.
>
> To follow the discussion, you can download a zipped Xcode project with  
> some real code here:
> http://cmgm.stanford.edu/~cparnot/temp/BCSequenceClassCluster.zip
> Don't try to compile, it probably won't succeed. It is just easier to  
> navigate the code in this familiar format.
>
> OK, so how would a class cluster look like?
>
>
> 1. The user point of view
> ----------------------
>
> For the user, there is only one class, called BCSequence. Instances  
> are immutable and can be obtained with a number of factory methods, or  
> using alloc followed by init methods. These are defined in the only  
> header file accessible to the user, BCSequence.h (see attached  
> project).
>
> From the user point of view, the usage is very simple: just create a  
> sequence with one of the numerous factory or init methods, including  
> reading from files. The instance you get back is immutable, but you  
> can create new instances from it by removing/adding pieces, or  
> transforming it to another type. You can always check the type and  
> length, get the sequence back into a string or array of symbols. You  
> can feed tools with that BCSequence instance and get the results,  
> potentially getting back other instances of BCSequence.
>
> There are 2 things the user could complain about:
>   a- Some of the methods are only relevant for certain sequence types
>   b- Sequence objects are immutable
>
> About complaint (a)
> In the header file BCSequence.h of the attached project, there are 2  
> methods that are only relevant to a subset of the BCSequence type:  
> -complement and -reverseComplement. This is not a really big concern  
> at this point, because this is just 2 methods and it is quite easy to  
> return something for all cases (for a protein, probably just return  
> itself). But more methods in BCSequence or in the BCTools could give  
> the same issues. For instance, BCToolDigest. That would only have  
> sense on a DNA sequence when using restriction enzymes.
> The class BCSequence would always return something, empty sequences in  
> the worst case, leaving the troubles to the runtime. This is the only  
> appropriate way to handle it with the class cluster design, maybe  
> together with some error codes/handling mechanism.
> But the user may want to be more specific about the BCSequence type  
> and get some compiler warnings when appropriate, instead of leaving it  
> to the runtime. The user might be ready to give up the simplicity of a  
> unique class and use more specific types. This is the issue of weak vs  
> strong typing, which relates to the issue of compiler vs runtime  
> errors/warnings.
> One possible answer is to say to the user: this is the way it is, just  
> accept it!! And I believe as a first version, it is really OK. But  
> there are also some ways to give the user the possibility to choose  
> between strong and weak typing and keep the class cluster design, that  
> I will explain later, below.
>
> About complaint (b)
> I thought of enforcing immutability as a starting point, as this is  
> easier on the developer side to deal with immutable objects. Giving  
> the option of immutability to the user is anyway a good thing, as it  
> allows a number of optimizations, that could really pay off in a real  
> application with lots of copying, ref passing,...
> Of course, it is nice to also have mutable objects. I will address  
> that on the developer point of view (see below). Note that ultimately,  
> one thing would probably always be immutable: the sequence type.
>
>
> 2. Implementing the class cluster
> ------------------------------
>
> The class cluster that I implement in the attached project looks very  
> much like what you have already done. There is a superclass  
> BCSequence, and then subclasses, BCSequenceDNA,  
> BCSequenceRNA,...etc... plus a new special subclass BCSequenceFactory.  
> Now the purpose of a class cluster is that the user just does  
> everything using the public interface for BCSequence, and as far as  
> the user is concerned, every object is an instance of BCSequence. But  
> inside the hood, you actually return instances of one of the  
> subclasses so that some operations can be optimized for the particular  
> type of sequence you are dealing with.
>
> The problem for the developer of a class cluster is that you know  
> which subclass to use only once you call one of the init methods, but  
> you still have to do the 'alloc' before the init. There is no way  
> BCSequence will know what subclass it should use at the time 'alloc'  
> is called. So the trick is to alloc a temporary instance of a  
> particular subclass, a 'placeholder' class. Look at the implementation  
> of 'alloc' in BCSequence.m. What this method returns is actually an  
> instance of BCSequenceFactory when called on the superclass (when  
> called on one of the subclass, though, it just passes the message up  
> to NSObject). The bottom line is: you never create an instance of  
> BCSequence, but an instance of BCSequenceFactory (you still alloc  
> instances of BCSequence subclasses, of course). In fact, that  
> BCSequenceFactory instance could be a singleton and never deallocated  
> if we changed the code a little bit.
>
> Then when one of the init method is called on that new  
> BCSequenceFactory instance. This method actually allocs and inits a  
> new object, an instance of the appropriate subclass. It then releases  
> self and returns a pointer to the new object created. Because she  
> should always use the value returned by init to set your pointers, the  
> user will get the right object in the end.
>
> To summarize, what happens when the user runs the following command:
> BCSequence *mySeq = [[BCSequence alloc] initWithDNAString:aString];
>
> You have the following happening
> * [BCSequence alloc] returns an instance of BCSequenceFactory
> * the message initWithDNAString:aString is sent to the  
> BCSequenceFactory instance
> * in the method, a second object is created by calling
> 	finalObject=[[BCSequenceDNA alloc] initWithString:aString]
> * then the method calls [self release] to destroy the original  
> BCSequenceFactory instance
> * then the method returns the finalObject
> * so now mySeq=final Object and is an instance of BCSequenceDNA
>
> You get the same process when the user calls:
> BCSequence *mySeq = [[BCSequence alloc] initWithString:aString];
> except BCSequenceFactory first figures out to what subclass it should  
> send the 'initWithString' message (using the same code as the original  
> BCFactorySequence).
>
> Then all the other methods are just convenience methods calling these  
> building blocks.
>
> Like for any superclass/subclass pattern, it is important to define  
> what methods the subclasses should, may or should not override, and I  
> have a summary of that in the attached project. It is very similar to  
> what you have already done.
>
>
> 3. Pros and cons
> ---------------
>
> What are the potential pitfalls and limitations:
> (a) how to still provide the user with some more static typing when  
> she wants more control over it? This is complaint (a) of part (1)  
> above.
> (b) how to provide mutable/immutable versions? This is complaint (b)  
> of part (1) above.
> (c) the class cluster assumes all the methods can be called on all the  
> subclasses. Will that always be relevant? The case of 'complement' is  
> already a bit troublesome, and how about even worse cases, like  
> 'digestWithRestrictionEnzyme:'. It does not make any sense for a  
> protein, does it? The question is really: how does that fit with the  
> BCTools? Could problem arise as we define more and more tools? Will it  
> be that easy to add more private subclasses without breaking the  
> existing code?
> (d) What about the recent developments: does BCSymbolList fit in the  
> picture? how do you add the annotation stuff to that?
>
> I have answers to all of these, and I will come back to these  
> different points below, in other parts of my email. And there might be  
> other pitfalls I don't see yet.
>
> But first, while writing the code and thinking about the whole  
> concept, I also realized the potential benefits of a class cluster,  
> and there are more than what I anticipated. Some of these benefits are  
> really the benefits you get from OO, but are even more apparent with  
> such a simple interface where things are even more encapsulated  
> because it is almost like you have just one class:
> * super simple interface for the user; she also gets the benefit of  
> polymorphism without the need to know the existence of all the  
> subclasses;
> * because the public interface is reduced, the developer can make  
> plenty of changes without breaking existing code developed by the user
> * in particular, it allows the addition of new types of sequences or  
> optimized subclasses for particular uses, that may in most cases  
> already work with the code developed by the user; so the user can get  
> new functionality for free
> * the same is true for code developed by the developers of the  
> framework:
> - developers can work on other parts of the framework without knowing  
> too much about the guts of BCSequence
> - by relying on just one class for interactions between the different  
> pieces of BioCocoa, it simplifies the development and minimize  
> disruptions as modifications are made to BCSequence
>
> I remember in the discussions, there was some disagreement about  
> having subclasses (Alex's choice) or just one class which would decide  
> what to do depending on the symbolSet used (Koen's choice); maybe a  
> class cluster is a way to have many of the benefits of the 2 systems  
> without too many of the problems.
> More about pros and cons of class cluster on the Apple web site:
> http://developer.apple.com/documentation/Cocoa/Conceptual/ 
> CocoaObjects/Articles/ClassClusters.html
>
> For me, the bottom line is still unclear. At present, I feel that a  
> class cluster would work really well. But we have to anticipate now  
> all the potential problems, and we should decide if it is worth it.
>
>
>
> 4. Compile vs runtime errors
> --------------------------
> This is a discussion about complaint (a) of part (1) and pitfall (a)  
> of part (3). What if the user wants more control over the type of  
> sequence it is using and want some compiler warnings when trying to  
> cut a protein with EcoRI, or get its complementary sequence?
>
> At this point, the class cluster does not allow that. All the methods  
> are valid for all the sequence types. In this context, an invalid call  
> will only be revealed at runtime, and a BCProtein object would have to  
> decide at runtime to return something when sent an irrelevant message.  
> What should it send back? This issue is actually slightly different  
> from the discussion here and is discussed in part 6 (sorry this whole  
> email is quite large and complicated; I am trying to keep it  
> readable!). The question here is really: can we prevent that from even  
> happening when the user knows what type of sequence she is dealing  
> with and could get compiler warnings?
>
> One way to help with that is to provide an additional set of headers  
> defining some public classes named BCSequenceDNA, BCSequenceRNA,....  
> These classes would just be placeholders, and would be completely  
> disctint from the subclasses of BCSequence (I will come back to the  
> name conflict). They would have some init methods, but when the user  
> uses these classes and alloc/init an instance, she would get in fact  
> one of the BCSequence subclasses. The compiler would not know and  
> would trust the headers to generate warning. For instance, the header  
> for the BCSequenceProtein placeholder class would not define the  
> methods 'complement' or 'cutWithRestrictionEnzyme:', and you would get  
> a compiler warning even though the object would in fact respond to the  
> methods at runtime (but would have to return some dummy values). So  
> these headers would really define completely virtual classes. One of  
> the problem is the names of these placeholder classes conflict with  
> the names of the BCSequence private subclasses that are defined in the  
> project I sent. We could rename the latter to BCSeqDNA/RNA/... for  
> example, and keep the nice full names 'BCSequenceDNA/RNA/...' for the  
> placeholder public classes.
>
> An alternative is to define protocols, and so the user would have to  
> use (id <BCSequenceDNA>) in the code. The BCSequence would provide  
> methods to return objects typed this way. It is a bit of a pain to  
> type id <BCSequenceDNA> all the time and reduces readability, though.
>
> So there are ways to solve the problem. Note that the problem is not  
> really tied to the class cluster implementation and is already partly  
> a problem that the current code is facing, as I talked about at the  
> very beginning of the email (OK, now is a good time to reread  
> everything!!).
>
> Of course, the interface then becomes a bit schizophrenic, so it may  
> not be such a good idea to allow all of that. At least in the  
> beginning, there may be not such a high need for stronger typing, and  
> this goes a bit against the whole idea of a simple interface and a  
> class cluster.
>
>
>
> 5. Mutable and immutable instances
> --------------------------------
> This is a discussion about complaint (b) of part (1) and pitfall (b)  
> of part (3).
>
> Why impose immutable objects? Not sure.
> This is not something I had thought of at first, but it is anyway an  
> important issue that goes beyond the idea of class cluster. Immutable  
> objects allows very important and basic optimizations, particularly  
> when copying objects, and are sufficient for most uses. A smart user  
> will use immutable objects whenever it can and will only go to mutable  
> objects if really necessary. This is something we may have to think  
> about for the BioCocoa project anyway. I am not saying it is  
> absolutely necessary but it should be discussed (and maybe it has  
> been??).
>
> To implement mutable objects in the class cluster could be a bit  
> tricky, because there are two conflicting subclass organizations here:  
> mutable/immutable and dna/rna/protein/codon. To get all the  
> combinations, it seems that we need 8 subclasses!!
>
> I am not completely sure how to deal with it, or if we should deal  
> with it or just give up and stick to mutable only. One possibility is  
> to not have distinct subclasses for mutable/immutable. Instead, there  
> could be simply a BOOL flag 'isMutable' as one of the instance  
> variables. The object would then return different results in key  
> methods such as 'copy' depending on the value of the flag. Also, at  
> creation, it would create mutable or immutable instance variables  
> (NSArray or NSMutableArray) depending on the value of that flag. It is  
> OK to declare a mutable object as the instance variable and then  
> actually use it to allocate an immutable object, as long as we are  
> consistent in the methods called to avoid runtime errors (and we  
> should use some casts to avoid compiler warnings).
>
>
> 6. Potential clashes in the future
> --------------------------------
> This is a discussion about pitfall (c) of part (3).
> The problem is: will the class cluster ever become a problem in the  
> future and force us to rewrite everything and lose our sleep?
> The short answer is: I don't know!
>
> I guess any pattern can get in the way in some unpredicted way at some  
> unpredicted point in the future. We can try to anticipate those  
> issues. In the case of the class cluster, some of the questions to  
> answer are obviously: how do we deal with irrelevant messages sent to  
> inappropriate subclasses, such as sending 'complement' to a  
> BCSequenceProtein? how frequent these messages will be? how do we deal  
> with new sequence types that could be introduced later? how frequently  
> will new sequence types be needed?
>
> The answer to that is to list as much as we can all the methods that  
> would have to go in the final implementation of BCSequence and see how  
> the current sequence types could deal with it. Also, we would have to  
> think about what other types of sequences could be added in the future  
> (which could be inspired by other BioX projects) and hope that a  
> future BCSequenceExtraterrestrial won't break everything. This may  
> have already been discussed earlier on the mailing list?
>
> Some examples of how to deal with irrelevant methods:
> * complement of a protein: return the same sequence; return an empty  
> sequence; return nil??
> * cut a protein with EcoRI: OK, this is easy, you just get the same  
> protein!! Or do you get the sequence of the EcoRI protein!!!
> * etc...
>
> The existing code will have to deal with this anyway. When I look at  
> the present code, I see you can return BCSequence objects without  
> knowing the type, as returned by 'sequenceWithString:' in the  
> BCSequenceFactory class. And then, this is allowed to get in the  
> BCToolComplement with the method 'complementToolWithSequence:'. What  
> if the BCSequence created is a protein? The abstraction that you did  
> encode in BCSymbol already allows you to deal with it, you did a great  
> job!
>
>
> 7. Full incorporation of the present implementation
> ----------------------------------------------
> This is a discussion about pitfall (d) of part (3).
>
> The implementation I attached to the email is quite basic and could be  
> further refined to incorporate the features and organization of the  
> current implementation and the short-term planned additions. The  
> current class tree can  probably be used as is. One problem is the  
> name BCSequence would be taken for the superclass; this is probably  
> the name that should be public. Then we could have the following:
> * BCSymbolList = subclass of BCSequence
> * BCSeq = subclass of BCSymbolList with annotations
> * BCSeqDNA, BCSeqRNA, etc... = subclasses of BCSeq with optimized  
> methods for the different types of sequences
>
> The additional benefit is that the instance variables would not even  
> be in the public header anymore, but in the subclass BCSymbolList (and  
> BCSequenceFactory would then be even lighter, with no instance  
> variable at all). An alternative is to decide that BCSymbolList would  
> actually be BCSequence, and the annotated BCSequence would become  
> BCSeq.
>
> It is thus mostly a problem of naming, which is somewhat secondary,  
> but is still quite important because it would be here to stay and has  
> to be easy to remember and logical...
>
> An additional problem is that if you instantiate BCSymbolList (in the  
> case of non-annotated sequences), you want to make sure that it can  
> handle ALL the messages declared in the header. It is not clear to me  
> yet that it can do it.
>
>
> 8. Happy new year!
> ------------------
>
> ... and thanks for reading this up to that point!
>
> Charles
>
>
>
> -- 
> Charles Parnot
> charles.parnot at stanford.edu
>
> Help science go fast forward:
> http://cmgm.stanford.edu/~cparnot/xgrid-stanford/
>
> Room  B157 in Beckman Center
> 279, Campus Drive
> Stanford University
> Stanford, CA 94305 (USA)
>
> Tel +1 650 725 7754
> Fax +1 650 725 8021
> _______________________________________________
> Biococoa-dev mailing list
> Biococoa-dev at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biococoa-dev
>