<HTML>

<HEAD>

<TITLE>Re: [Biococoa-dev] WWDC 2005 BioCocoa meeting</TITLE>

</HEAD>

<BODY>

<FONT FACE="Verdana, Helvetica, Arial"><SPAN STYLE='font-size:12.0px'>Hi, all -<BR>

<BR>

Great news on the presentation – I’m just disappointed I won’t be able to go to California to meet with the rest of you.  If any of you have stopovers in NYC on the way to California, though, get in touch.<BR>

<BR>

I spent the weekend too loaded up on flu medication to actually do any coding, but I threw together a quick start of a description of the Foundation classes.  If any of you find it useful for getting the presentation started, please use anything you like.  If you find it incoherent, I blame the medicine.<BR>

<BR>

Enjoy the trip, and send me a full report of WWDC.  Hope they give you a neat gift at the keynote -<BR>

<BR>

John  <BR>

<BR>

<BR>

<BR>

<BR>

</SPAN></FONT><SPAN STYLE='font-size:12.0px'><FONT FACE="Helvetica, Verdana, Arial"><B>Design of BCFoundation<BR>

</B><BR>

Biological information is generally conveyed as an ordered sequence of fundamental units, typically nucleotide bases or amino acids.  Interpretation and transformation of these units is typically carried out by enzymes or collections of enzymes (ie - ribosomes).  We have attempted to make BioCocoa's Foundation reflect this organization.  The fundamental information carrying units belong to the BCSymbol class, which has subclasses for all the individual types of unit.  Specific collections of related symbols (ie - all amino acids) are available through the BCSymbolSet class.  Ordered arrays of these units are managed through BCSequence and its subclasses.  The transformation or extraction of information from these sequences is managed by subclasses of BCTools. <BR>

<BR>

The function and design philosophy of each class is discussed in detail below.  Several additional functions are provided by other groups of classes, such as BCAlignment and BCGenetic code; these are discussed following the description of these core classes.<BR>

<BR>

<BR>

<B>Intelligent Objects:  BCSymbols<BR>

</B><BR>

In many biological frameworks, amino acids and nucleotides are stored as char's.  Although very lightweight, char's cannot provide any information regarding the biological object they are representing; information must be obtained by looking it up based on the char value.  This limits the application of object-oriented design, and adds to code complexity, as each information lookup will occur in different sections of code, and may be accomplished by different means in each place.<BR>

<BR>

We have taken an opposite approach:  each nucleotide and amino acid is a full-featured object which conveys relevant information about its properties, ranging from the complement(s) of a nucleotide to the pKA of an amino acid.  To allow the rapid addition of new information to these Symbols, the properties are stored in an Apple-standard property list file.  Each symbol retains its entire entry in this file throughout its lifetime, and its properties are accessible via a "valueForKey:" message using the appropriate key.  Thus, adding a new property (for example, the frequency at which it occurs in an alpha-helix) to all amino acids is as simple as editing an XML file.  Retrieving that value can then be done with [aSymbol valueForKey: @"alpha helicity"];.  For code efficiency reasons, many of the basic properties are also stored as ivars and accessible by standard methods.  <BR>

<BR>

This allows each symbol to be a repository for relevant information, with a standardized method of looking up that information.  It also greatly simplifies the writing of methods that retrieve information from all symbols in a sequence.  For example, a molecular weight calculation can be as simple as the following:<BR>

</FONT><FONT FACE="Courier New">for ( i = 0; i < [theSequence length]; i++ ) <BR>

    molecularWeight = molecularWeight + [[theSequence symbolAtIndex: i] molecularWeight];<BR>

</FONT><FONT FACE="Helvetica, Verdana, Arial"><BR>

To keep memory use to a minimum, each symbol is maintained as a singleton.  In other words, every sequence that has an alanine has a pointer to the single alanine instance in its place.  Symbols are typically accessed one of two ways:  Either via a call to the appropriate class method using the unichar symbol, or via a named class method, such as [BCAminoAcid alanine].  This allows the relationships between symbols (ie - complement) to be stored in the property list file - Symbol pointers can be generated by using the string "alanine" as a selector.  Alternately, a string formatted sequence can be translated into a symbol array simply by passing each of its characters to the appropriate class method.<BR>

<BR>

Each symbol type also has gap and undefined symbols.  The symbols are also grouped within Symbol Sets.  Several such sets are pre-made singletons - for example, all non-ambiguous ribose nucleotides, all non-gap and undefined amino acids, etc.  Symbol sets are primarily used to provide internal consistency when symbols are combined into a sequence.<BR>

<BR>

<BR>

<B>As You Like it:  the BCSequence class cluster<BR>

</B><BR>

The BCSequence class cluster is an effort to balance a set of competing issues.  In many cases (file format conversions, mass calculations), the specific type of sequence is irrelevant to the developer, and a generic sequence class provides all the functionality needed.  In other cases (complementation, translation), providing the appropriate sequence type - a nucleotide sequence - will provide developers finer control and prevent errors.  To provide for both needs, BCSequences are implemented as a class cluster.  Any sequence can be created and used as either a generic sequence class, or a specific subclass - headers for each are included in the framework.  Control of the sequence type and its composition can also be done at the level of setting its Symbol Set.  A sequence's Symbol Set both defines the sequence type (ie - DNA, amino acid) and potentially limits its composition (ie - all non-ambiguous RNA bases).  <BR>

<BR>

This design creates one issue, however:  how should methods that only act on a specific sequence type be organized and structured when they may be handed a generic sequence type?  We are grouping the methods that act on one or more sequence types in the BCTools section.  A tool is always initialized using a generic sequence.  If it's initialized with a sequence it can't operate on (say, the BCComplementTool initialized with a protein sequence), it always returns a copy of the sequence when any transformation is requested of it.  If it can operate on the sequence, everything works normally.  Wherever possible, convenience methods have been added to the appropriate subclasses that will initialize any tools that can act on that sequence type in a way that's guaranteed to work.<BR>

<BR>

<BR>

</FONT></SPAN>

</BODY>

</HTML>