[Biococoa-dev] more ramblings

Wed Nov 17 17:51:00 EST 2004

On Nov 16, 2004, at 10:24 PM, John Timmer wrote:

> Once again, I have to say I think this is a really bad idea.  Let me 
> count
> the ways...
>
> For starters, nucleotide and protein sequences have some things in 
> common,
> but in general, they're very different.  They have different 
> information
> content.  You do different things with them.  Why try to squish them
> together?  Treating them as the same object reduces the object's 
> information
> content without gaining any clear benefit.
>

For me, the differences between nucleotide and protein sequences are 
the building blocks. That's where all the information is. Other than 
that a sequence is 'just a black box with an array of symbols', 
regardless of the nature of the symbols. Adding, removing, replacing 
symbols is now indeed handled through BCSequence, and is similar for 
all types of sequences. When obtaining a complement/reverse complement 
of a sequence, you actually do this for each individual symbol. The 
sequence is just a convenient way to iterate over the symbols. This is 
why we store information about complements, etc in the symbols, not in 
the sequence. When searching a sequence, you actually compare symbols 
not the sequence itself. The same for translating, digesting, MW 
calculations, etc. (And each symbol knows what kind of subclass of 
BCSymbol it is, even though we use selfSymbol = (BCSymbol 
*)CFArrayGetValueAtIndex( (CFArrayRef) selfArray, loopCounter), so 
there is no need to explicitly cast them as BCAminoAcid, etc).

> One of the whole ideas of object oriented programming is to group the 
> data
> with the methods that act on it.

That's very true, but I don't think that I am trying to separate the 
methods from the data, because the actual method to get a complement, 
MW, etc is in BCSymbol and it's subclasses, together with the data.

> The rangeOfSubsequence isn't the horrible situation you make it out to 
> be.

I tried to be milder this time :)

> We've got a set of related methods that work on all sequences in the
> superclass.  When I get free time (hah!), I'll move the other set 
> (handling
> ambiguous sequences) into the superclass to - I'll just have it check 
> for
> the sequence type at the start.

Right now, you have several very similar methods in BCSequence (and its 
subclasses). As I said before, this is usually a situation in OOP when 
one has to rethink the design, and try to find a way to avoid 
duplicating code. This is what I tried to do by introducing the 
BCFindSequence class. If you look at that code, you'll see that I even 
didn't have to check for which BCSymbol I was dealing with. The real 
comparing is done by the symbol itself. By setting a flag, you can 
search for ambiguous as well, all with the same method. But if you can 
find a way to simplify the code within BCSequence, I'll be all for it. 
And you'll see that when the code is only in BCSequence, all sequences 
are treated equally, the search algorithm is the same for proteins and 
nucleotide sequences, only the symbols are different.

> The FASTA situation is also a bad example to support your case.  Some 
> file
> formats contain information regarding the type of sequence, other's 
> don't.
> Why should we make a sequence object handle that, or create a new 
> class to
> act as an intermediary - dealing with differences in file format is 
> the job
> of the object that knows about file formats, not a sequence.
>

I agree with you not to make a sequence responsible for dealing with 
the file format, that would complicate things only more :) Dealing with 
differences in file format is what BCSequenceReader already does. It's 
first method tests the first line for specific characters and then 
passes it on the the appropriate method, readFasta, readSwissProt, etc. 
This method then extracts the necessary information, including a string 
of symbols. But should the reader methods be concerned with whether it 
is a protein or nucleotide sequence? I don't think they should. The 
introduction of a factory class is a well established design pattern in 
OOP that deals with these sort of situations. An advantage is that when 
you ever decide to change the way a sequence is created, or introduce a 
new type of sequence, you only have to modify the code once in the 
factory, not in each readXXX method. Or maybe later on we decide to 
implement a new read method, or introduce a class to obtain a sequence 
from a database. Maybe the user types in a sequence in an NSTextView 
and wants to make a BCSequence. Should each of these classes then try 
to figure out whether it's a protein or nucleotide sequence? If we keep 
that code in one place (factory or whatever you want to call it) it 
makes it much easier to maintain.

> Given all these things I view as negatives, I still don't understand 
> what
> advantages a single sequence class would provide.  The concrete 
> examples you
> provide seem to me to be causing more organizational issues than they 
> solve,
> and not following good OOP design.

I have added some more examples in this reply, and hopefully showed 
that this is also a good OOP design. I am very guilty of supporting the 
BCSequence subclasses myself when we just started. But now that 
BioCocoa is growing, I came to the realization that we may have to 
shuffle things around to make the code easier to use and maintain. I 
have enough experience with OOP to know that when the project becomes 
larger, you're glad that you kept the code modular. If you ever decide 
you have to change a method, it's much easier to just fix it in one 
place, instead of to have to remember in which places this code was 
added.

>  My first instinct would be to take
> anything in BCFindSequence and work it back in to BCSequence.

Please do so, but leave the BCFindSequence code as an alternative :)

>
> Another way to think about this - let's assume that Apple knows what 
> they're
> doing in designing their classes.  The most analogous item in Cocoa's
> Foundation is NSMutableString.  There is only one utility class that's
> directly related to strings (NSScanner - maybe two with 
> NSCharacterSet).
> Just about all the methods needed for handling the contents of strings 
> are
> either in NSMutableString or its superclass.  It's good design.
>

NSString indeed maintains a list of characters, and also does some 
basic character manipulation, and substring searching. But it doesn't 
translate a string to another language!

cheers,

- Koen.