[Biococoa-dev] Design question
Alexander Griekspoor
mek at mekentosj.com
Mon Aug 9 18:12:50 EDT 2004
Hi guys,
Back from a sunny and warm long weekend, let me continue our discussion
on framework design and implementation based on Koen's input. Let me
start by proposing an idea which again is derived from the BioJava
framework. I think also in this respect they must have had a similar
discussion as we are having here, and I see there solution to the
problem as a very nice one. First an explanation of the idea, then my
thoughts in this light as a reply to the things brought up.
Basically we have two options:
either go for a string based solution (as sequences are kind of long
strings in the end), or go for a specific sequence class approach. As
outlined in the link above, the string based approach has some clear
disadvantages:
1 One would constantly need validation of strings as they allow non
existing characters. I use strings in my programs to store sequences (I
bet everyone does) and I constantly strip "foreign" characters upon
editing, copying, dragging etc, in fact that's how I call my method ;-)
2 Ambiguity is hard to support, quoted: "The meaning of each symbol is
not necessarily clear. The `T' which means thymidine in DNA is the same
`T' which is a threonine residue in a protein sequence"
3 Limited alphabet, Koen already mentioned that glycans are hard to
express in single letter codes.
This is all solved by a class based approach where all nucleotides,
amino acids, glycans etc are represented by there own class. However,
John already pointed to the weak spot here, instantiating so many
objects quickly results in big memory problems.
The guys at BioJava came up with a nice solution, the best of both
world so to speak: http://www.biojava.org/tutorials/chap1.html
What we do is create singleton objects (think "sharedDefaultManager")
for each class of "symbol", then refer to these using pointers. A
sequence like "ATGC" would be an array in the form of: "pointer to
shared "A" object, pointer to shared "T" object,pointer to shared "G"
object, pointer to shared "C" object, etc" All used objects are present
in memory only once, and the sequence is an array of pointers which is
very cheap memory wise. To highlight some of the things in this
approach which I like very much:
- Great performance memory wise
- The "symbol" classes can store all additional data like name, pi, etc
- Solution to the ambiguity problem (see the getMatches() method)
I have quite some experience using singleton classes in the form of
sharedcontrollers. they are easy to implement and work very, very well.
Replying in the light of this idea:
> In the last few messages I missed any mention of a BCSequence class
> that more or less functions as the center of the framework. It's main
> member is then probably an NSString representing the sequence (DNA,
> RNA, protein). This is easy because they all are single character
> based sequences (unlike eg glycans).
Again, the glycans can now be easily implemented in the form of a set
of glycans symbols (shared objects) and now glycan sequences can be
expressed as "sequences"
> Additionally this class could have an NSMutableArray member consisting
> of objects that represent each single base, amino acid. Whenever the
> NSString is edited, the NSMutableArray is updated, and vice versa. We
> could have a BCRootObject from which BCAminoAcid, BCNucleotide, etc
> derive. These nucleotide and amino acid classes can then store more
> info about themselves, eg long name, pI, MW, modifications,
> annotations, etc. Also BCFunctionalGroup (methyl, phosphate) could be
> based on BCRootObject.
As said, John already mentioned the big pitfall of this approach; it's
never wise to have these things present in parallel. First of all, it
requires careful synchronization, and what happens if it does got out
of sync?. Second, it requires at least twice the amount of memory as
you have both a string and symbollist around. This is all prevented
with the "shared symbollist approach" as we now have one "datasource".
To convert to the string based world, I could very well imagine that
the BCRootObject will have a
- (NSString *)stringRepresentation; method that converts the symbollist
and spits out a NSString for you (based on how the string part is
defined in the symbol classes, which also defines long name, pI, MW,
modifications, annotations, etc). We can also implement a number of
"stringRepresentationForRange" methods. I think we should discuss how
exactly the functional groups should be worked out. Either as separate
symbols, or as possible "properties" of the base class. Example: should
phosporylated-Serine be a separate "BCSymbol", or should
phosphorylation be a "BCFunctionalGroup" that can be added to a symbol?
Properly the first option if we go for shared symbols, as you can
either add a property to all serines or none in this approach. The
alternative option is to keep a modification dictionary (modification
and position) associated at the sequence level instead of the symbol
one.
> Regarding the question whether the sequences should be 0-based or
> 1-based, I suggest we use both :) The BCSequence can have an NSRange
> member that is 1-based (or two ints indicating the start and end
> position), and the NSString and NSMutableArray are both 0-based.
The tutorial above mentions an interesting choice: "Note that numbering
of Symbols within the SymbolList runs from 1 to length, not from 0 to
length-1 as is the case with Java strings. This is consistent with the
coordinate system found in files of annotated biological sequences."
Maybe we should do the same here.
> Another thing is that we should try to make the enzyme class (or any
> class that acts on a sequence) universal so it works both for DNA and
> proteins. Or at least have a base clase and put specific functionality
> in a DNAEnzyme and ProteinEnzyme class.
I agree, but as they might be very different, we indeed should go for a
general enzyme superclass and further define stuff in subclasses. I
guess that's something we will find out rapidly during coding. Another
note here is that the shared symbols approach nicely allows defining
the recognition site for both types as for instance the fact that a T
stands for both thymidine and threonine forms no problem using this
system.
> Here are some liks to naming conventions:
>
> <http://developer.apple.com/documentation/Cocoa/Conceptual/
> CodingGuidelines/Articles/NamingBasics.html>
> <http://developer.apple.com/documentation/Cocoa/Conceptual/
> CodingGuidelines/Articles/NamingIvarsAndTypes.html>
Great articles, I guess we should stick to those conventions as close
as possible. By the way, John it also shows you how to do enumerations:
typedef enum {
NSRadioModeMatrix = 0,
NSHighlightModeMatrix = 1,
NSListModeMatrix = 2,
NSTrackModeMatrix = 3
} NSMatrixMode;
You place them in the header file of the specific class you plan to use
them in. Alternatively, we can add a file called BCConstants.h where
more general enumerations and constants can be placed.
One other remark, I guess what John meant were these enumerations, not
to be confused with the enumerators for arrays. That's something
completely different. These have as a big advantage that they are very
lightweight, and give names to commonly used values. In addition, the
integer value allows you to do math and nice comparisons.
Example:
typedef enum {
BCLowPriority = 0,
BCNormalPriority = 1,
BCHighPriority = 2,
BCVeryHighPriority = 3
} BCPriority;
Given an integer called priority assigned with one of these, you can
now do things like priority++ to increase priority; or things like
if(priority>1) then... to check priorities of compare them. Very handy
and the cocoa frameworks are filled with these (NSNotFound anyone? Or
NSPortraitOrientation?) Tip: make sure you leave plenty of room for
extension.
> BTW, I am writing an app that among other things digests proteins
> (could you guess ;-), and can provide
> code for that.
Yes! Very nice!
I'm curious to your thoughts about this guys, I know it's easier to
talk about implementation than the coding (for which I do not have much
time right now, luckily that should change soon), but deciding on the
right foundation may save a lot of efforts later!
Looking forward to your replies!
Cheers,
Alex
Ps. Again, I encourage everyone to read the tutorial I linked to, and
if you have time to further dive into the BioJava docs (further then I
did), I'm sure there are plenty of more design decisions they made from
which we can learn and take advantage of...
*********************************************************
** Alexander Griekspoor **
*********************************************************
The Netherlands Cancer Institute
Department of Tumorbiology (H4)
Plesmanlaan 121, 1066 CX, Amsterdam
Tel: + 31 20 - 512 2023
Fax: + 31 20 - 512 2029
AIM: mekentosj at mac.com
E-mail: a.griekspoor at nki.nl
Web: http://www.mekentosj.com
LabAssistant - Get your life organized!
http://www.mekentosj.com/labassistant
*********************************************************
**************************************************************
** Alexander Griekspoor **
**************************************************************
The Netherlands Cancer Institute
Department of Tumorbiology (H4)
Plesmanlaan 121, 1066 CX, Amsterdam
Tel: + 31 20 - 512 2023
Fax: + 31 20 - 512 2029
AIM: mekentosj at mac.com
E-mail: a.griekspoor at nki.nl
Web: http://www.mekentosj.com
MacOS X: The power of UNIX with the simplicity of the Mac
***************************************************************
More information about the Biococoa-dev
mailing list