[Biococoa-dev] Base design

Alexander Griekspoor mek at mekentosj.com
Wed Aug 11 15:44:58 EDT 2004


John,

Before you start of perhaps it nice to read the attached pieces of  
documentation that describe the logic and architecture of the BioJava  
setup we discussed earlier. I like the idea they have of Symbols that  
make up Alphabets from which you can generate symbollists. Have a look  
at their docs starting at the org.biojava.bio.symbol package to see  
which methods each class implements. I like the idea of sticking as  
close as possible to their setup in terms of classes and class methods,  
the implementation is up to you of course. I suggest also to download  
the biojava source and see how they implemented stuff.
Some remarks about the methods you mention:

> (NSString *) symbol; // A, T, etc.
This would be biojava's "getToken()" method, they choose for a char  
instead of an object like NSString, perhaps this would indeed be wiser  
if this method is used in memory sensitive methods.

> (BOOL) representsSingleNucleotide; // YES if G, NO if W, etc.
Biojava neatly uses the distinction between an atomic symbol (that  
represents only one), and basissymbols (like N, purine, w, g) etc I  
guess that removes a lot of checking as it's defined in the class  
already (see documentation snippets below).
>

I still wonder how we should implement the specific classes (you  
mention the shared headers, which I don't really get exactly).
Haven't found out how biojava does that either, but it seems you have a  
plan already. Anyway, maybe it's time to plunge in the coding waters  
;-)

To switch to a somewhat different subject, it might be a good idea to  
ask everyone to document there programming using the headerdoc system.  
It's quite easy, but lateron allows us to quickly generate  
documentation in a format familiar to most developers. Here's a link to  
the exact details:
http://developer.apple.com/documentation/DeveloperTools/Conceptual/ 
HeaderDoc/index.html?http://developer.apple.com/documentation/ 
DeveloperTools/Conceptual/HeaderDoc/tags/chapter_2_section_2.html#// 
apple_ref/doc/uid/TP40001215-CH346-DontLinkElementID_6129

Click on the show TOC to get to the complete documentation for  
headerdoc. As an example I have attached a file from the AGRegex  
framework, it nicely shows what it will look like. To avoid the removal  
of the attachment, here it is inline:

The really cool thing is that you can use XCode's Applescript menu to  
quickly insert templates, as easy as it can get!

Let me know what you think of all this...
Cheers,
Alex



  Package org.biojava.bio.symbol Description

Representation of the Symbols that make up a sequence, and locations  
within them.

This package is not intended to have strong biological ties. It is here  
to make programming things like dynamic-programming much easier. It  
also handles serialization of well-known alphabets so that applicable  
singleton properties of alphabets and Symbols are maintained.

All coordinates are in 'bio-coordinates' - that is - legal indexes  
start from 1 and a range is inclusive (4 to 7 includes 4, 5, 6 and 7).

A Symbol is a single token. The Symbol maintains a name, a token  
(char), and an Annotation bundle. A set of Symbols is represented by an  
Alphabet instance. If the Alphabet can guarantee that there are only  
ever a finite number of Symbols contained with in it, then it must  
implement FiniteAlphabet. The Symbol objects within a FiniteAlphabet  
can be tested for equality by comparing their references directly. A  
SymbolList is a string over the Symbols from a single Alphabet  
instance. This allows you to represent a sequence of tokens, such as  
DNA nucleotides, or stock-market prices.

Locations within a SymbolList can be represented by a Location object.  
This interface defines a sub-set of points that are within the  
Location. This uses bio-coordinates, and defines all the operations  
that you are likely to need to build your own Locations (union,  
intersection and the like).

public interface Symbol
extends Annotatable

A single symbol.

  This is the atomic unit of a SymbolList, or a sequence. It allows  for  
fine-grain fly-weighting, so that there can be one instance  of each  
symbol that is referenced multiple times.

  Symbols from finite alphabets are identifiable using the == operator.   
Symbols from infinite alphabets may have some specific API to test for   
equality, but should realy over-ride the equals() method.

  Some symbols represent a single token in the sequence. For example,  
there is  a Symbol instance for adenine in DNA, and another one for  
cytosine.  Symbols can potentialy represent sets of Symbols. For  
example, n represents  any DNA Symbol, and X any protein Symbol. Gap  
represents the knowledge that  there is no Symbol. In addition, some  
symbols represent ordered lists of  other Symbols. For example, the  
codon agt can be represented by a single  Symbol from the Alphabet  
DNAxDNAxDNA. Symbols can represent ambiguity over  these complex  
symbols. For example, you could construct a Symbol instance  that  
represents the codons atn. This matches the codons {ata, att, atg,  
atc}.  It is also possible to build a Symbol instance that represents  
all stop  codons {taa, tag, tga}, which can not be represented in terms  
of a  single ambiguous n'tuple.

  There are three Symbol interfaces. Symbol is the most generic. It has  
the  methods getToken and getName so that the Symbol can be textually  
represented.  In addition, it defines getMatches that returns an  
Alphabet over all the  AtomicSymbol instances that match the Symbol (N  
would return an Alphabet  containing {A, G, C, T}, and Gap would return  
{}).

  BasisSymbol instances can always be represented by an n'tuple of  
BasisSymbol  instances. It adds the method getSymbols so that you can  
retrieve this list.  For example, the tuple [ant] is a BasisSymbol, as  
it is uniquely specified  with those three BasisSymbol instances a, n  
and t. n is a BasisSymbol  instance as it is uniquely represented by  
itself.

  AtomicSymbol instances specialize BasisSymbol by guaranteeing that  
getMatches  returns a set containing only that instance. That is, they  
are indivisable.  The DNA nucleotides are instances of AtomicSymbol, as  
are individual codons.  The stop codon {tag} will have a getMatches  
method that returns {tag},  a getBases method that also returns {tag}  
and a getSymbols method that returns  the List [t, a, g]. {tna} is a  
BasisSymbol but not an AtomicSymbol as it  matches four AtomicSymbol  
instances {taa, tga, tca, tta}. It follows that  each symbol in  
getSymbols for an AtomicSymbol instance will also be  AtomicSymbol  
instances.

public interface AtomicSymbol
extends BasisSymbol

  A symbol that is not ambiguous.

  Atomic symbols are the real underlying elements that a SymbolList is  
meant  to be composed of. DNA nucleotides are atomic, as are  
amino-acids. The  getMatches() method should return an alphabet  
containing just the one Symbol.

  The Symbol instances for single codons would be instances of  
AtomicSymbol as  they can only be represented as a Set of symbols from  
their alphabet that  contains just that one symbol.

  AtomicSymbol instances guarantee that getMatches returns an Alphabet   
containing just that Symbol and each element of the List returned by   
getSymbols is also atomic.

public interface BasisSymbol
extends Symbol

  A symbol that can be represented as a string of Symbols.

  BasisSymbol instances can always be represented uniquely as a single  
List of  BasisSymbol instances. The symbol N is a BasisSymbol - it can  
be uniquely  represented by N. It matches {a, g, c, t}.  Similarly, the  
symbol atn is a BasisSymbol, as it can be uniquely  represented with a  
single list of symbols [a, t, n]. Its getMatches will  return the set  
{ata, att, atg, atc}.

  The getSymbols method returns the unique list of BasisSymbol instances  
that  this is composed from. For example, the codon ambiguity symbol  
atn will have  a getSymbols method that returns the list [a, t, n]. The  
getMatches method  will return an Alphabet containing each AtomicSymbol  
that can be made by  expanding the list of BasisSymbol instances.

public interface Alphabet
extends Annotatable

  The set of AtomicSymbols which can be concatenated together to make a   
SymbolList.

  A non-atomic symbol is considered to be contained within this alphabet  
if  all of the atomic symbols that it could match are members of this  
alphabet.

public interface FiniteAlphabet
extends Alphabet

An alphabet over a finite set of Symbols.

  This interface makes the distinction between an alphabet over a finite  
(and  possibly small) number of symbols and an Alphabet over an  
infinite  (or extremely large) set of symbols. Within a FiniteAlphabet,  
the == operator  should be sufficient to decide upon equality for all  
AtomicSymbol instances.

  The alphabet functions as the repository of objects in the fly-weight  
design  pattern. Only symbols within an alphabet should appear in  
object that claim  to use the alphabet - otherwise something is in  
error.

public interface SymbolList
extends Changeable

A sequence of symbols that belong to an alphabet.

  This uses biological coordinates (1 to length).

public interface GappedSymbolList
extends SymbolList

This extends SymbolList with API for manipulating, inserting and  
deleting  gaps.

  You could make a SymbolList that contains gaps directly. However, this  
  leaves you with a nasty problem if you wish to support gap-edit  
operations. Also, the original  SymbolList must either be coppied or  
lost.

  GappedSymbolList solves these problems. It will maintain  a  
data-structure that places gaps. You can add and remove the gaps by  
using the public API.

  For gap-insert operations, the insert index is the position that will  
become a gap. The  symbol currently there will move to a higher index.  
To insert leading gaps, add gaps at index  1. To insert trailing gaps,  
add gaps at index length+1.

>

**********************
// AGRegex.h
//
// Copyright (c) 2002 Aram Greenman. All rights reserved.
//
// Redistribution and use in source and binary forms, with or without  
modification, are permitted provided that the following conditions are  
met:
//
// 1. Redistributions of source code must retain the above copyright  
notice, this list of conditions and the following disclaimer.
// 2. Redistributions in binary form must reproduce the above copyright  
notice, this list of conditions and the following disclaimer in the  
documentation and/or other materials provided with the distribution.
// 3. The name of the author may not be used to endorse or promote  
products derived from this software without specific prior written  
permission.
//
// THIS SOFTWARE IS PROVIDED BY THE AUTHOR "AS IS" AND ANY EXPRESS OR  
IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED  
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE  
DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT,  
INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES  
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR  
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)  
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,  
STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING  
IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE  
POSSIBILITY OF SUCH DAMAGE.

#import <Foundation/NSObject.h>
#import <Foundation/NSRange.h>

@class AGRegex, NSArray, NSString;

/*!
@enum Options
Options defined for -initWithPattern:options:. Two or more options can  
be combined with the bitwise OR operator.
@constant AGRegexCaseInsensitive Matching is case insensitive.  
Equivalent to /i in Perl.
@constant AGRegexDotAll Dot metacharacter matches any character  
including newline. Equivalent to /s in Perl.
@constant AGRegexExtended Allow whitespace and comments in the pattern.  
Equivalent to /x in Perl.
@constant AGRegexLazy Makes greedy quantifiers lazy and lazy  
quantifiers greedy. No equivalent in Perl.
@constant AGRegexMultiline Caret and dollar anchors match at newline.  
Equivalent to /m in Perl.
*/
enum {
	AGRegexCaseInsensitive = 1,
	AGRegexDotAll = 2,
	AGRegexExtended = 4,
	AGRegexLazy = 8,
	AGRegexMultiline = 16
};

/*!
@class AGRegexMatch
@abstract A single occurence of a regular expression.
@discussion An AGRegexMatch represents a single occurence of a regular  
expression within the target string. The range of each subpattern  
within the target string is returned by -range, -rangeAtIndex:, or  
-rangeNamed:. The part of the target string that matched each  
subpattern is returned by -group, -groupAtIndex:, or -groupNamed:.
*/
@interface AGRegexMatch : NSObject {
	AGRegex *regex;
	NSString *string;
	int *matchv;
	int count;
}

/*!
@method count
The number of capturing subpatterns, including the pattern itself. */
- (int)count;

/*!
@method group
Returns the part of the target string that matched the pattern. */
- (NSString *)group;

/*!
@method groupAtIndex:
Returns the part of the target string that matched the subpattern at  
the given index or nil if it wasn't matched. The subpatterns are  
indexed in order of their opening parentheses, 0 is the entire pattern,  
1 is the first capturing subpattern, and so on. */
- (NSString *)groupAtIndex:(int)idx;

/*!
@method groupNamed:
Returns the part of the target string that matched the subpattern of  
the given name or nil if it wasn't matched. */
- (NSString *)groupNamed:(NSString *)name;

/*!
@method range
Returns the range of the target string that matched the pattern. */
- (NSRange)range;

/*!
@method rangeAtIndex:
Returns the range of the target string that matched the subpattern at  
the given index or {NSNotFound, 0} if it wasn't matched. The  
subpatterns are indexed in order of their opening parentheses, 0 is the  
entire pattern, 1 is the first capturing subpattern, and so on. */
- (NSRange)rangeAtIndex:(int)idx;

/*!
@method rangeNamed:
Returns the range of the target string that matched the subpattern of  
the given name or {NSNotFound, 0} if it wasn't matched. */
- (NSRange)rangeNamed:(NSString *)name;

/*!
@method string
Returns the target string. */
- (NSString *)string;

@end

/*!
@class AGRegex
@abstract An Perl-compatible regular expression class.
@discussion An AGRegex is created with -initWithPattern: or  
-initWithPattern:options: or the corresponding class methods  
+regexWithPattern: or +regexWithPattern:options:. These take a regular  
expression pattern string and the bitwise OR of zero or more option  
flags. For example:

<code>    AGRegex *regex = [[AGRegex alloc]  
initWithPattern:@"(paran|andr)oid"  
options:AGRegexCaseInsensitive];</code>

Matching is done with -findInString: or -findInString:range: which look  
for the first occurrence of the pattern in the target string and return  
an AGRegexMatch or nil if the pattern was not found.

<code>    AGRegexMatch *match = [regex  
findInString:@"paranoid android"];</code>

A match object returns a captured subpattern by -group, -groupAtIndex:,  
or -groupNamed:, or the range of a captured subpattern by -range,  
-rangeAtIndex:, or -rangeNamed:. The subpatterns are indexed in order  
of their opening parentheses, 0 is the entire pattern, 1 is the first  
capturing subpattern, and so on. -count returns the total number of  
subpatterns, including the pattern itself. The following prints the  
result of our last match case:

<code>    for (i = 0; i < [match count]; i++)<br  
/>
        NSLog(@"%d %@  
%@", i, NSStringFromRange([match rangeAtIndex:i]), [match  
groupAtIndex:i]);</code>

<code>    0 {0, 8} paranoid<br />
    1 {0, 5} paran</code>

If any of the subpatterns didn't match, -groupAtIndex: will  return  
nil, and -rangeAtIndex: will return {NSNotFound, 0}. For example, if we  
change our original pattern to "(?:(paran)|(andr))oid" we will get the  
following output:

<code>    0 {0, 8} paranoid<br />
    1 {0, 5} paran<br />
    2 {2147483647, 0} (null)</code>

-findAllInString: and -findAllInString:range: return an NSArray of all  
non-overlapping occurrences of the pattern in the target string.  
-findEnumeratorInString: and -findEnumeratorInString:range: return an  
NSEnumerator for all non-overlapping occurrences of the pattern in the  
target string. For example,

<code>    NSArray *all = [regex  
findAllInString:@"paranoid android"];</code>

The first object in the returned array is the match case for "paranoid"  
and the second object is the match case for "android".

AGRegex provides the methods -replaceWithString:inString: and  
-replaceWithString:inString:limit: to perform substitution on strings.

<code>    AGRegex *regex = [AGRegex  
regexWithPattern:@"remote"];<br />
    NSString *result = [regex  
replaceWithString:@"complete" inString:@"remote control"]; //  
result is "complete control"</code>

Captured subpatterns can be interpolated into the replacement string  
using the syntax $x or ${x} where x is the index or name of the  
subpattern. $0 and $& both refer to the entire pattern. Additionally,  
the case modifier sequences \U...\E, \L...\E, \u, and \l are allowed in  
the replacement string. All other escape sequences are handled  
literally.

<code>    AGRegex *regex = [AGRegex  
regexWithPattern:@"[usr]"];<br />
    NSString *result = [regex  
replaceWithString:@"\\u$&." inString:@"Back in the ussr"];  
// result is "Back in the U.S.S.R."</code>

Note that you have to escape a backslash to get it into an NSString  
literal.

Named subpatterns may also be used in the pattern and replacement  
strings, like in Python.

<code>    AGRegex *regex = [AGRegex  
regexWithPattern:@"(?P<who>\\w+) is a  
(?P<what>\\w+)"];<br />
    NSString *result = [regex  
replaceWithString:@"Jackie is a $what, $who is a runt"  
inString:@"Judy is a punk"]); // result is "Jackie is a punk, Judy  
is a runt"</code>

Finally, AGRegex provides -splitString: and -splitString:limit: which  
return an NSArray created by splitting the target string at each  
occurrence of the pattern. For example:

<code>    AGRegex *regex = [AGRegex  
regexWithPattern:@"ea?"];<br />
    NSArray *result = [regex  
splitString:@"Repeater"]; // result is "R", "p", "t", "r"</code>

If there are captured subpatterns, they are returned in the array.

<code>    AGRegex *regex = [AGRegex  
regexWithPattern:@"e(a)?"];<br />
    NSArray *result = [regex  
splitString:@"Repeater"]; // result is "R", "p", "a", "t",  
"r"</code>

In Perl, this would return "R", undef, "p", "a", "t", undef, "r".  
Unfortunately, there is no convenient way to represent this in an  
NSArray. (NSNull could be used in place of undef, but then all members  
of the array couldn't be expected to be NSStrings.)
*/
@interface AGRegex : NSObject {
	void *regex;
	void *extra;
	int groupCount;
}

/*!
@method regexWithPattern:
Creates a new regex using the given pattern string. Returns nil if the  
pattern string is invalid. */
+ (id)regexWithPattern:(NSString *)pat;

/*!
@method regexWithPattern:options:
Creates a new regex using the given pattern string and option flags.  
Returns nil if the pattern string is invalid. */
+ (id)regexWithPattern:(NSString *)pat options:(int)opts;


/*!
@method initWithPattern:
Initializes the regex using the given pattern string. Returns nil if  
the pattern string is invalid. */
- (id)initWithPattern:(NSString *)pat;

/*!
@method initWithPattern:options:
Initializes the regex using the given pattern string and option flags.  
Returns nil if the pattern string is invalid. */
- (id)initWithPattern:(NSString *)pat options:(int)opts;

/*!
@method findInString:
Calls findInString:range: using the full range of the target string. */
- (AGRegexMatch *)findInString:(NSString *)str;

/*!
@method findInString:range:
Returns an AGRegexMatch for the first occurrence of the regex in the  
given range of the target string or nil if none is found. */
- (AGRegexMatch *)findInString:(NSString *)str range:(NSRange)r;

/*!
@method findAllInString:
Calls findAllInString:range: using the full range of the target string.  
*/
- (NSArray *)findAllInString:(NSString *)str;

/*!
@method findAllInString:range:
Returns an array of all non-overlapping occurrences of the regex in the  
given range of the target string. The members of the array are  
AGRegexMatches. */
- (NSArray *)findAllInString:(NSString *)str range:(NSRange)r;

/*!
@method findEnumeratorInString:
Calls findEnumeratorInString:range: using the full range of the target  
string. */
- (NSEnumerator *)findEnumeratorInString:(NSString *)str;

/*!
@method findEnumeratorInString:range:
Returns an enumerator for all non-overlapping occurrences of the regex  
in the given range of the target string. The objects returned by the  
enumerator are AGRegexMatches. */
- (NSEnumerator *)findEnumeratorInString:(NSString *)str  
range:(NSRange)r;

/*!
@method replaceWithString:inString:
Calls replaceWithString:inString:limit: with no limit. */
- (NSString *)replaceWithString:(NSString *)rep inString:(NSString  
*)str;

/*!
@method replaceWithString:inString:limit:
Returns the string created by replacing occurrences of the regex in the  
target string with the replacement string. If the limit is positive, no  
more than that many replacements will be made.

Captured subpatterns can be interpolated into the replacement string  
using the syntax $x or ${x} where x is the index or name of the  
subpattern. $0 and $& both refer to the entire pattern.  
Additionally, the case modifier sequences \U...\E, \L...\E, \u, and \l  
are allowed in the replacement string. All other escape sequences are  
handled literally. */
- (NSString *)replaceWithString:(NSString *)rep inString:(NSString  
*)str limit:(int)limit;

/*!
@method splitString:
Call splitString:limit: with no limit. */
- (NSArray *)splitString:(NSString *)str;

/*!
@method splitString:limit:
Returns an array of strings created by splitting the target string at  
each occurrence of the pattern. If the limit is positive, no more than  
that many splits will be made. If there are captured subpatterns, they  
are returned in the array.  */
- (NSArray *)splitString:(NSString *)str limit:(int)lim;

@end

*********************************************************
                     ** Alexander Griekspoor **
*********************************************************
              The Netherlands Cancer Institute
              Department of Tumorbiology (H4)
         Plesmanlaan 121, 1066 CX, Amsterdam
                    Tel:  + 31 20 - 512 2023
                    Fax:  + 31 20 - 512 2029
                   AIM: mekentosj at mac.com
                    E-mail: a.griekspoor at nki.nl
                Web: http://www.mekentosj.com

	Claiming that the Macintosh is inferior to Windows
	because most people use Windows, is like saying
	that all other restaurants serve food that is
	inferior to McDonalds

*********************************************************
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 26390 bytes
Desc: not available
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20040811/05094ce7/attachment.bin>


More information about the Biococoa-dev mailing list