[Biococoa-dev] Design question

Mon Aug 9 18:12:50 EDT 2004

Hi guys,

Back from a sunny and warm long weekend, let me continue our discussion  
on framework design and implementation based on Koen's input. Let me  
start by proposing an idea which again is derived from the BioJava  
framework. I think also in this respect they must have had a similar  
discussion as we are having here, and I see there solution to the  
problem as a very nice one. First an explanation of the idea, then my  
thoughts in this light as a reply to the things brought up.

Basically we have two options:
either go for a string based solution (as sequences are kind of long  
strings in the end), or go for a specific sequence class approach. As  
outlined in the link above, the string based approach has some clear  
disadvantages:
1 One would constantly need validation of strings as they allow non  
existing characters. I use strings in my programs to store sequences (I  
bet everyone does) and I constantly strip "foreign" characters upon  
editing, copying, dragging etc, in fact that's how I call my method ;-)
2 Ambiguity is hard to support, quoted:  "The meaning of each symbol is  
not necessarily clear. The `T' which means thymidine in DNA is the same  
`T' which is a threonine residue in a protein sequence"
3 Limited alphabet, Koen already mentioned that glycans are hard to  
express in single letter codes.

This is all solved by a class based approach where all nucleotides,  
amino acids, glycans etc are represented by there own class. However,  
John already pointed to the weak spot here, instantiating so many  
objects quickly results in big memory problems.

The guys at BioJava came up with a nice solution, the best of both  
world so to speak: http://www.biojava.org/tutorials/chap1.html
What we do is create singleton objects (think "sharedDefaultManager")  
for each class of "symbol", then refer to these using pointers. A  
sequence like "ATGC" would be an array in the form of: "pointer to  
shared "A" object, pointer to shared "T" object,pointer to shared "G"  
object, pointer to shared "C" object, etc" All used objects are present  
in memory only once, and the sequence is an array of pointers which is  
very cheap memory wise. To highlight some of the things in this  
approach which I like very much:
- Great performance memory wise
- The "symbol" classes can store all additional data like name, pi, etc
- Solution to the ambiguity problem (see the getMatches() method)

I have quite some experience using singleton classes in the form of  
sharedcontrollers. they are easy to implement and work very, very well.  
Replying in the light of this idea:

> In the last few messages I missed any mention of a BCSequence class  
> that more or less functions as the center of the framework. It's main  
> member is then probably an NSString representing the sequence (DNA,  
> RNA, protein). This is easy because they all are single character  
> based sequences (unlike eg glycans).
Again, the glycans can now be easily implemented in the form of a set  
of glycans symbols (shared objects) and now glycan sequences can be  
expressed as "sequences"

> Additionally this class could have an NSMutableArray member consisting  
> of objects that represent each single base, amino acid. Whenever the  
> NSString is edited, the NSMutableArray is updated, and vice versa. We  
> could have a BCRootObject from which BCAminoAcid, BCNucleotide, etc  
> derive. These nucleotide and amino acid classes can then store more  
> info about themselves, eg long name, pI, MW, modifications,  
> annotations, etc. Also BCFunctionalGroup (methyl, phosphate) could be  
> based on BCRootObject.
As said, John already mentioned the big pitfall of this approach; it's  
never wise to have these things present in parallel. First of all, it  
requires careful synchronization, and what happens if it does got out  
of sync?. Second, it requires at least twice the amount of memory as  
you have both a string and symbollist around. This is all prevented  
with the "shared symbollist approach" as we now have one "datasource".  
To convert to the string based world, I could very well imagine that  
the BCRootObject will have a
- (NSString *)stringRepresentation; method that converts the symbollist  
and spits out a NSString for you (based on how the string part is  
defined in the symbol classes, which also defines long name, pI, MW,  
modifications, annotations, etc). We can also implement a number of  
"stringRepresentationForRange" methods. I think we should discuss how  
exactly the functional groups should be worked out. Either as separate  
symbols, or as possible "properties" of the base class. Example: should  
phosporylated-Serine be a separate "BCSymbol", or should  
phosphorylation be a "BCFunctionalGroup" that can be added to a symbol?  
Properly the first option if we go for shared symbols, as you can  
either add a property to all serines or none in this approach. The  
alternative option is to keep a modification dictionary (modification  
and position) associated at the sequence level instead of the symbol  
one.

> Regarding the question whether the sequences should be 0-based or  
> 1-based, I suggest we use both :) The BCSequence can have an NSRange  
> member that is 1-based (or two ints indicating the start and end  
> position), and the NSString and NSMutableArray are both 0-based.
The tutorial above mentions an interesting choice: "Note that numbering  
of Symbols within the SymbolList runs from 1 to length, not from 0 to  
length-1 as is the case with Java strings. This is consistent with the  
coordinate system found in files of annotated biological sequences."  
Maybe we should do the same here.

> Another thing is that we should try to make the enzyme class (or any  
> class that acts on a sequence) universal so it works both for DNA and  
> proteins. Or at least have a base clase and put specific functionality  
> in a DNAEnzyme and ProteinEnzyme class.
I agree, but as they might be very different, we indeed should go for a  
general enzyme superclass and further define stuff in subclasses. I  
guess that's something we will find out rapidly during coding. Another  
note here is that the shared symbols approach nicely allows defining  
the recognition site for both types as for instance the fact that a T  
stands for both  thymidine and threonine forms no problem using this  
system.

> Here are some liks to naming conventions:
>
> <http://developer.apple.com/documentation/Cocoa/Conceptual/ 
> CodingGuidelines/Articles/NamingBasics.html>
> <http://developer.apple.com/documentation/Cocoa/Conceptual/ 
> CodingGuidelines/Articles/NamingIvarsAndTypes.html>
Great articles, I guess we should stick to those conventions as close  
as possible. By the way, John it also shows you how to do enumerations:
typedef enum {
     NSRadioModeMatrix          = 0,
     NSHighlightModeMatrix    = 1,
     NSListModeMatrix              = 2,
     NSTrackModeMatrix          = 3
} NSMatrixMode;

You place them in the header file of the specific class you plan to use  
them in. Alternatively, we can add a file called BCConstants.h where  
more general enumerations and constants can be placed.
One other remark, I guess what John meant were these enumerations, not  
to be confused with the enumerators for arrays. That's something  
completely different. These have as a big advantage that they are very  
lightweight, and give names to commonly used values. In addition, the  
integer value allows you to do math and nice comparisons.
Example:
typedef enum {
     BCLowPriority          = 0,
     BCNormalPriority    = 1,
     BCHighPriority              = 2,
     BCVeryHighPriority          = 3
} BCPriority;

Given an integer called priority assigned with one of these, you can  
now do things like priority++ to increase priority; or  things like  
if(priority>1) then... to check priorities of compare them. Very handy  
and the cocoa frameworks are filled with these (NSNotFound anyone? Or  
NSPortraitOrientation?) Tip: make sure you leave plenty of room for  
extension.

> BTW, I am writing an app that among other things digests proteins  
> (could you guess ;-), and can provide
> code for that.
Yes! Very nice!

I'm curious to your thoughts about this guys, I know it's easier to  
talk about implementation than the coding (for which I do not have much  
time right now, luckily that should change soon), but deciding on the  
right foundation may save a lot of efforts later!
Looking forward to your replies!
Cheers,
Alex

Ps. Again, I encourage everyone to read the tutorial I linked to, and  
if you have time to further dive into the BioJava docs (further then I  
did), I'm sure there are plenty of more design decisions they made from  
which we can learn and take advantage of...

*********************************************************
                       ** Alexander Griekspoor **
*********************************************************
                 The Netherlands Cancer Institute
                 Department of Tumorbiology (H4)
           Plesmanlaan 121, 1066 CX, Amsterdam
                     Tel:  + 31 20 - 512 2023
                     Fax:  + 31 20 - 512 2029
                    AIM: mekentosj at mac.com
                     E-mail: a.griekspoor at nki.nl
                 Web: http://www.mekentosj.com

           LabAssistant - Get your life organized!
           http://www.mekentosj.com/labassistant

*********************************************************

**************************************************************
                         ** Alexander Griekspoor **
**************************************************************
                  The Netherlands Cancer Institute
                  Department of Tumorbiology (H4)
             Plesmanlaan 121, 1066 CX, Amsterdam
                        Tel:  + 31 20 - 512 2023
                        Fax:  + 31 20 - 512 2029
                       AIM: mekentosj at mac.com
                       E-mail: a.griekspoor at nki.nl
                    Web: http://www.mekentosj.com

MacOS X: The power of UNIX with the simplicity of the Mac

***************************************************************