[Biococoa-dev] more ramblings
Koen van der Drift
kvddrift at earthlink.net
Wed Nov 17 17:51:00 EST 2004
On Nov 16, 2004, at 10:24 PM, John Timmer wrote:
> Once again, I have to say I think this is a really bad idea. Let me
> count
> the ways...
>
> For starters, nucleotide and protein sequences have some things in
> common,
> but in general, they're very different. They have different
> information
> content. You do different things with them. Why try to squish them
> together? Treating them as the same object reduces the object's
> information
> content without gaining any clear benefit.
>
For me, the differences between nucleotide and protein sequences are
the building blocks. That's where all the information is. Other than
that a sequence is 'just a black box with an array of symbols',
regardless of the nature of the symbols. Adding, removing, replacing
symbols is now indeed handled through BCSequence, and is similar for
all types of sequences. When obtaining a complement/reverse complement
of a sequence, you actually do this for each individual symbol. The
sequence is just a convenient way to iterate over the symbols. This is
why we store information about complements, etc in the symbols, not in
the sequence. When searching a sequence, you actually compare symbols
not the sequence itself. The same for translating, digesting, MW
calculations, etc. (And each symbol knows what kind of subclass of
BCSymbol it is, even though we use selfSymbol = (BCSymbol
*)CFArrayGetValueAtIndex( (CFArrayRef) selfArray, loopCounter), so
there is no need to explicitly cast them as BCAminoAcid, etc).
> One of the whole ideas of object oriented programming is to group the
> data
> with the methods that act on it.
That's very true, but I don't think that I am trying to separate the
methods from the data, because the actual method to get a complement,
MW, etc is in BCSymbol and it's subclasses, together with the data.
> The rangeOfSubsequence isn't the horrible situation you make it out to
> be.
I tried to be milder this time :)
> We've got a set of related methods that work on all sequences in the
> superclass. When I get free time (hah!), I'll move the other set
> (handling
> ambiguous sequences) into the superclass to - I'll just have it check
> for
> the sequence type at the start.
Right now, you have several very similar methods in BCSequence (and its
subclasses). As I said before, this is usually a situation in OOP when
one has to rethink the design, and try to find a way to avoid
duplicating code. This is what I tried to do by introducing the
BCFindSequence class. If you look at that code, you'll see that I even
didn't have to check for which BCSymbol I was dealing with. The real
comparing is done by the symbol itself. By setting a flag, you can
search for ambiguous as well, all with the same method. But if you can
find a way to simplify the code within BCSequence, I'll be all for it.
And you'll see that when the code is only in BCSequence, all sequences
are treated equally, the search algorithm is the same for proteins and
nucleotide sequences, only the symbols are different.
> The FASTA situation is also a bad example to support your case. Some
> file
> formats contain information regarding the type of sequence, other's
> don't.
> Why should we make a sequence object handle that, or create a new
> class to
> act as an intermediary - dealing with differences in file format is
> the job
> of the object that knows about file formats, not a sequence.
>
I agree with you not to make a sequence responsible for dealing with
the file format, that would complicate things only more :) Dealing with
differences in file format is what BCSequenceReader already does. It's
first method tests the first line for specific characters and then
passes it on the the appropriate method, readFasta, readSwissProt, etc.
This method then extracts the necessary information, including a string
of symbols. But should the reader methods be concerned with whether it
is a protein or nucleotide sequence? I don't think they should. The
introduction of a factory class is a well established design pattern in
OOP that deals with these sort of situations. An advantage is that when
you ever decide to change the way a sequence is created, or introduce a
new type of sequence, you only have to modify the code once in the
factory, not in each readXXX method. Or maybe later on we decide to
implement a new read method, or introduce a class to obtain a sequence
from a database. Maybe the user types in a sequence in an NSTextView
and wants to make a BCSequence. Should each of these classes then try
to figure out whether it's a protein or nucleotide sequence? If we keep
that code in one place (factory or whatever you want to call it) it
makes it much easier to maintain.
> Given all these things I view as negatives, I still don't understand
> what
> advantages a single sequence class would provide. The concrete
> examples you
> provide seem to me to be causing more organizational issues than they
> solve,
> and not following good OOP design.
I have added some more examples in this reply, and hopefully showed
that this is also a good OOP design. I am very guilty of supporting the
BCSequence subclasses myself when we just started. But now that
BioCocoa is growing, I came to the realization that we may have to
shuffle things around to make the code easier to use and maintain. I
have enough experience with OOP to know that when the project becomes
larger, you're glad that you kept the code modular. If you ever decide
you have to change a method, it's much easier to just fix it in one
place, instead of to have to remember in which places this code was
added.
> My first instinct would be to take
> anything in BCFindSequence and work it back in to BCSequence.
Please do so, but leave the BCFindSequence code as an alternative :)
>
> Another way to think about this - let's assume that Apple knows what
> they're
> doing in designing their classes. The most analogous item in Cocoa's
> Foundation is NSMutableString. There is only one utility class that's
> directly related to strings (NSScanner - maybe two with
> NSCharacterSet).
> Just about all the methods needed for handling the contents of strings
> are
> either in NSMutableString or its superclass. It's good design.
>
NSString indeed maintains a list of characters, and also does some
basic character manipulation, and substring searching. But it doesn't
translate a string to another language!
cheers,
- Koen.
More information about the Biococoa-dev
mailing list