BioCocoa : Main / Design

This page describes the design of the new BioCocoa framework. The old framework is based on using strings to represent a sequence.

The main class of the BioCocoa framework is the BCSequence class. This is a generic class that can hold any type of sequence, as well as additional information about the sequence, stored in annotations and features (not yet implemented). After long and heated discussions between the developers on the mailinglist it was agreed not to subclass BCSequence, but to use a special class called the BCSymbolSet to define what type of sequence is contained in the sequence, eg DNA or protein. The symbol set will only contain those symbols that are allowed for a specific type of sequence. For instance, the symbol set for strict DNA contains four BCSymbol objects, for A, C, G, and T. On the other hand, the symbol set for non-strict DNA will also contain BCSymbol objects for the ambiguous symbols R, Y, M, K, S, W, H, B, V, D, and N. The main advantage of this is that only one sequence class needs to be maintained, instead of several for each type of sequence that may exist. Another advantage is that users can define their own symbol set by simply defining which symbols are allowed, and associate the symbol set with a BCSequence. The main disadvantage of using an untyped BCSequence is that any operation can now be exectued on any type of sequence. For instance translating a protein, which would make no sense biologically. However, this problem can also be solved by using the symbol set as a datafilter. This way other classes can now first test whether an operation on a particular sequence is allowed, making operations such as mentioned above not executable. Another example of a class that uses the symbol set as a data filter could be a sequence editor which now can validate whether only sequences are opened and edited that are indeed of a particular sequence type.

Another main design feature is the use of BCSymbol objects for each symbol. BCSymbol objects contain additional information about each symbol, such as molecular weight, long name (Cysteine instead of C), etc. In contrast to BCSequence, the BCSymbol class is subclassed for DNA-nucleotides, RNA-nucleotides, and amino acids. To avoid memory issues, each BCSymbol object is created only once as a shared instance during the execution of a program. This design is based on the Singleton and Flyweight design patterns.

Finally, instead of using an NSArray of BCSymbol objects, BioCocoa stores the sequences as an array of chars, stored in an NSData object. Using the NSData object will make the use of the char array very easy, no worries about old-fashioned char memory management, this is all taken care of by the NSData object. Many sequence manipulations in bioinformatics are based on string algorithms, so storing the sequence as such seems very logical compared to an NSArray of BCSymbol objects, which will be much slower. However, an array of BCSymbol objects will still be available, for instance when a user wants to calculate the molecular weight or isoelectric point of a sequence. Because BCSymbol objects are shared instances, BCSymbol based operations only use a small memory footprint, even when a sequence contains thousands of symbols. Using the NSData object may also prove to be useful when using Cocoa's CoreData.