[Biococoa-dev] WWDC 2005 BioCocoa meeting

Tue May 31 11:35:28 EDT 2005

Hi, all -

Great news on the presentation  I¹m just disappointed I won¹t be able to go
to California to meet with the rest of you.  If any of you have stopovers in
NYC on the way to California, though, get in touch.

I spent the weekend too loaded up on flu medication to actually do any
coding, but I threw together a quick start of a description of the
Foundation classes.  If any of you find it useful for getting the
presentation started, please use anything you like.  If you find it
incoherent, I blame the medicine.

Enjoy the trip, and send me a full report of WWDC.  Hope they give you a
neat gift at the keynote -

John  

Design of BCFoundation

Biological information is generally conveyed as an ordered sequence of
fundamental units, typically nucleotide bases or amino acids.
Interpretation and transformation of these units is typically carried out by
enzymes or collections of enzymes (ie - ribosomes).  We have attempted to
make BioCocoa's Foundation reflect this organization.  The fundamental
information carrying units belong to the BCSymbol class, which has
subclasses for all the individual types of unit.  Specific collections of
related symbols (ie - all amino acids) are available through the BCSymbolSet
class.  Ordered arrays of these units are managed through BCSequence and its
subclasses.  The transformation or extraction of information from these
sequences is managed by subclasses of BCTools.

The function and design philosophy of each class is discussed in detail
below.  Several additional functions are provided by other groups of
classes, such as BCAlignment and BCGenetic code; these are discussed
following the description of these core classes.

Intelligent Objects:  BCSymbols

In many biological frameworks, amino acids and nucleotides are stored as
char's.  Although very lightweight, char's cannot provide any information
regarding the biological object they are representing; information must be
obtained by looking it up based on the char value.  This limits the
application of object-oriented design, and adds to code complexity, as each
information lookup will occur in different sections of code, and may be
accomplished by different means in each place.

We have taken an opposite approach:  each nucleotide and amino acid is a
full-featured object which conveys relevant information about its
properties, ranging from the complement(s) of a nucleotide to the pKA of an
amino acid.  To allow the rapid addition of new information to these
Symbols, the properties are stored in an Apple-standard property list file.
Each symbol retains its entire entry in this file throughout its lifetime,
and its properties are accessible via a "valueForKey:" message using the
appropriate key.  Thus, adding a new property (for example, the frequency at
which it occurs in an alpha-helix) to all amino acids is as simple as
editing an XML file.  Retrieving that value can then be done with [aSymbol
valueForKey: @"alpha helicity"];.  For code efficiency reasons, many of the
basic properties are also stored as ivars and accessible by standard
methods.  

This allows each symbol to be a repository for relevant information, with a
standardized method of looking up that information.  It also greatly
simplifies the writing of methods that retrieve information from all symbols
in a sequence.  For example, a molecular weight calculation can be as simple
as the following:
for ( i = 0; i < [theSequence length]; i++ )
    molecularWeight = molecularWeight + [[theSequence symbolAtIndex: i]
molecularWeight];

To keep memory use to a minimum, each symbol is maintained as a singleton.
In other words, every sequence that has an alanine has a pointer to the
single alanine instance in its place.  Symbols are typically accessed one of
two ways:  Either via a call to the appropriate class method using the
unichar symbol, or via a named class method, such as [BCAminoAcid alanine].
This allows the relationships between symbols (ie - complement) to be stored
in the property list file - Symbol pointers can be generated by using the
string "alanine" as a selector.  Alternately, a string formatted sequence
can be translated into a symbol array simply by passing each of its
characters to the appropriate class method.

Each symbol type also has gap and undefined symbols.  The symbols are also
grouped within Symbol Sets.  Several such sets are pre-made singletons - for
example, all non-ambiguous ribose nucleotides, all non-gap and undefined
amino acids, etc.  Symbol sets are primarily used to provide internal
consistency when symbols are combined into a sequence.

As You Like it:  the BCSequence class cluster

The BCSequence class cluster is an effort to balance a set of competing
issues.  In many cases (file format conversions, mass calculations), the
specific type of sequence is irrelevant to the developer, and a generic
sequence class provides all the functionality needed.  In other cases
(complementation, translation), providing the appropriate sequence type - a
nucleotide sequence - will provide developers finer control and prevent
errors.  To provide for both needs, BCSequences are implemented as a class
cluster.  Any sequence can be created and used as either a generic sequence
class, or a specific subclass - headers for each are included in the
framework.  Control of the sequence type and its composition can also be
done at the level of setting its Symbol Set.  A sequence's Symbol Set both
defines the sequence type (ie - DNA, amino acid) and potentially limits its
composition (ie - all non-ambiguous RNA bases).

This design creates one issue, however:  how should methods that only act on
a specific sequence type be organized and structured when they may be handed
a generic sequence type?  We are grouping the methods that act on one or
more sequence types in the BCTools section.  A tool is always initialized
using a generic sequence.  If it's initialized with a sequence it can't
operate on (say, the BCComplementTool initialized with a protein sequence),
it always returns a copy of the sequence when any transformation is
requested of it.  If it can operate on the sequence, everything works
normally.  Wherever possible, convenience methods have been added to the
appropriate subclasses that will initialize any tools that can act on that
sequence type in a way that's guaranteed to work.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20050531/e29c19d5/attachment.html>