[Biococoa-dev] more ramblings

Wed Nov 17 22:26:44 EST 2004

Koen -

> For me, the differences between nucleotide and protein sequences are
> the building blocks. That's where all the information is. Other than
> that a sequence is 'just a black box with an array of symbols',
> regardless of the nature of the symbols. Adding, removing, replacing
> symbols is now indeed handled through BCSequence, and is similar for
> all types of sequences. When obtaining a complement/reverse complement
> of a sequence, you actually do this for each individual symbol. The
> sequence is just a convenient way to iterate over the symbols. This is
> why we store information about complements, etc in the symbols, not in
> the sequence. When searching a sequence, you actually compare symbols
> not the sequence itself. The same for translating, digesting, MW
> calculations, etc. (And each symbol knows what kind of subclass of
> BCSymbol it is, even though we use selfSymbol = (BCSymbol
> *)CFArrayGetValueAtIndex( (CFArrayRef) selfArray, loopCounter), so
> there is no need to explicitly cast them as BCAminoAcid, etc).

This is an interesting point, and may come down to style more than anything
else.  You're absolutely right that things like a complement are handled
primarily at the symbol level.  There are two reasons I'd argue that they're
worth keeping at the sequence level, too though.  The first is simply
convenience - a lot of people are going to want it, so why make them write
their own?  The second is that we can optimize it heavily (as we've already
done partly) in a way that not everybody is going to be interested in doing,
so a lot more people will have access to better code.  If you accept that
the method should exist at all, then it makes the most sense to put it with
the data it operates on (ie as part of some sort of sequence class).
Otherwise, anybody using the Framework has to figure out what class handles
that type of method, and then dig through its docs to find the appropriate
method.

> Right now, you have several very similar methods in BCSequence (and its
> subclasses). As I said before, this is usually a situation in OOP when
> one has to rethink the design, and try to find a way to avoid
> duplicating code.
Right, and last time this came up, I mentioned that I had every intention of
fixing it.  It's not a fundamental class structure problem - it was a
problem with me trying to put something in place first, and fix it later.  I
don't know how else to possibly say that this situation is temporary, and
doesn't say anything informative about the class structure.  I'd also like
to point out that having 2 methods vs. 1 method with a boolean flag, as
yours apparently does, doesn't make any argument about class complexity at
all.  I went back and forth on which to do for a while, and settled on 2.
If people prefer 1, it can be changed.

Anyway, it's probably good that I ran out of time.  I'm pretty sure that all
individual sequence elements now have the notion of ambiguity (if codons
don't have it, they should!  And I know who to blame if they don't!), it
should be easy to implement at the top level class.

> But should the reader methods be concerned with whether it
> is a protein or nucleotide sequence? I don't think they should.
My point was that in some cases, it absolutely has to.  If it's a protein
specific file format, the file reader has to specifically produce a protein
sequence, even if it's got illegal, DNA characters in it.  Good design would
also dictate that we have a defined way of how things should behave when
there is no way of determining what type of sequence it is from the file or
metadata.  

Now, I'm not arguing heavily about where this specificity should be provided
- a factory type object is fine by me.  The point was primarily that this
situation doesn't argue for or against having sequence subclasses.

>>  My first instinct would be to take
>> anything in BCFindSequence and work it back in to BCSequence.
> 
> Please do so, but leave the BCFindSequence code as an alternative :)
Don't worry, I'd never intentionally delete someone else's work.  Where's
that located in the CVS directory structure, anyway?  It's not showing up as
an added file in XCode on my machine, so I'd like to download it at some
point.

> NSString indeed maintains a list of characters, and also does some
> basic character manipulation, and substring searching. But it doesn't
> translate a string to another language!
Funny you should mention that - it actually does, to a degree.  Look at the
locale methods.  Changing case is also a form of translation.  Just to prove
that I'm not interested in mindlessly adding stuff to a class, though, if I
were left in charge, I'd move all the path methods over to NSFileManager
immediately ;).

Anyway, my main argument is that there are a lot of things that are going to
be specific for one type of sequence or another.  Hydrophobicity, charge,
etc. that are all protein specific, while complements, melting temperature,
haripin possibilities, GC% and such are all DNA/RNA specific.  There are
also going to be a lot of useful things that are sequence-type specific that
none of us here have thought about yet, and will only be revealed to us if
we get more developers onto the project who need that feature.  We're going
to want sequence-type specific methods to do all these things.

It comes down to the design decision of whether you want to send the
sequence off somewhere else to get information back on it, or whether you
want to ask the sequence to tell you something about itself.  I'd say that
for the most part, for someone trying to use this framework, it's much
easier to ask the sequence, instead of trying to figure out what
object/method they need to send the sequence to.  I also don't think that it
leads to a painful burden on us developers in terms of organization.

I think the individual symbols are great examples of this approach - they
are incredibly powerful because, unlike a character, they know things about
themselves.  You don't have to dig around to find out which class/method are
needed to find out what the complement of a base is - the base already knows
what its complement is.  I'd love to see the same power extended to
sequences as a whole.

Anyway, I think i've blathered enough on this topic for one day -

Cheers,

John

_______________________________________________
This mind intentionally left blank