[Biococoa-dev] Base design
Alexander Griekspoor
mek at mekentosj.com
Wed Aug 11 15:44:58 EDT 2004
John,
Before you start of perhaps it nice to read the attached pieces of
documentation that describe the logic and architecture of the BioJava
setup we discussed earlier. I like the idea they have of Symbols that
make up Alphabets from which you can generate symbollists. Have a look
at their docs starting at the org.biojava.bio.symbol package to see
which methods each class implements. I like the idea of sticking as
close as possible to their setup in terms of classes and class methods,
the implementation is up to you of course. I suggest also to download
the biojava source and see how they implemented stuff.
Some remarks about the methods you mention:
> (NSString *) symbol; // A, T, etc.
This would be biojava's "getToken()" method, they choose for a char
instead of an object like NSString, perhaps this would indeed be wiser
if this method is used in memory sensitive methods.
> (BOOL) representsSingleNucleotide; // YES if G, NO if W, etc.
Biojava neatly uses the distinction between an atomic symbol (that
represents only one), and basissymbols (like N, purine, w, g) etc I
guess that removes a lot of checking as it's defined in the class
already (see documentation snippets below).
>
I still wonder how we should implement the specific classes (you
mention the shared headers, which I don't really get exactly).
Haven't found out how biojava does that either, but it seems you have a
plan already. Anyway, maybe it's time to plunge in the coding waters
;-)
To switch to a somewhat different subject, it might be a good idea to
ask everyone to document there programming using the headerdoc system.
It's quite easy, but lateron allows us to quickly generate
documentation in a format familiar to most developers. Here's a link to
the exact details:
http://developer.apple.com/documentation/DeveloperTools/Conceptual/
HeaderDoc/index.html?http://developer.apple.com/documentation/
DeveloperTools/Conceptual/HeaderDoc/tags/chapter_2_section_2.html#//
apple_ref/doc/uid/TP40001215-CH346-DontLinkElementID_6129
Click on the show TOC to get to the complete documentation for
headerdoc. As an example I have attached a file from the AGRegex
framework, it nicely shows what it will look like. To avoid the removal
of the attachment, here it is inline:
The really cool thing is that you can use XCode's Applescript menu to
quickly insert templates, as easy as it can get!
Let me know what you think of all this...
Cheers,
Alex
Package org.biojava.bio.symbol Description
Representation of the Symbols that make up a sequence, and locations
within them.
This package is not intended to have strong biological ties. It is here
to make programming things like dynamic-programming much easier. It
also handles serialization of well-known alphabets so that applicable
singleton properties of alphabets and Symbols are maintained.
All coordinates are in 'bio-coordinates' - that is - legal indexes
start from 1 and a range is inclusive (4 to 7 includes 4, 5, 6 and 7).
A Symbol is a single token. The Symbol maintains a name, a token
(char), and an Annotation bundle. A set of Symbols is represented by an
Alphabet instance. If the Alphabet can guarantee that there are only
ever a finite number of Symbols contained with in it, then it must
implement FiniteAlphabet. The Symbol objects within a FiniteAlphabet
can be tested for equality by comparing their references directly. A
SymbolList is a string over the Symbols from a single Alphabet
instance. This allows you to represent a sequence of tokens, such as
DNA nucleotides, or stock-market prices.
Locations within a SymbolList can be represented by a Location object.
This interface defines a sub-set of points that are within the
Location. This uses bio-coordinates, and defines all the operations
that you are likely to need to build your own Locations (union,
intersection and the like).
public interface Symbol
extends Annotatable
A single symbol.
This is the atomic unit of a SymbolList, or a sequence. It allows for
fine-grain fly-weighting, so that there can be one instance of each
symbol that is referenced multiple times.
Symbols from finite alphabets are identifiable using the == operator.
Symbols from infinite alphabets may have some specific API to test for
equality, but should realy over-ride the equals() method.
Some symbols represent a single token in the sequence. For example,
there is a Symbol instance for adenine in DNA, and another one for
cytosine. Symbols can potentialy represent sets of Symbols. For
example, n represents any DNA Symbol, and X any protein Symbol. Gap
represents the knowledge that there is no Symbol. In addition, some
symbols represent ordered lists of other Symbols. For example, the
codon agt can be represented by a single Symbol from the Alphabet
DNAxDNAxDNA. Symbols can represent ambiguity over these complex
symbols. For example, you could construct a Symbol instance that
represents the codons atn. This matches the codons {ata, att, atg,
atc}. It is also possible to build a Symbol instance that represents
all stop codons {taa, tag, tga}, which can not be represented in terms
of a single ambiguous n'tuple.
There are three Symbol interfaces. Symbol is the most generic. It has
the methods getToken and getName so that the Symbol can be textually
represented. In addition, it defines getMatches that returns an
Alphabet over all the AtomicSymbol instances that match the Symbol (N
would return an Alphabet containing {A, G, C, T}, and Gap would return
{}).
BasisSymbol instances can always be represented by an n'tuple of
BasisSymbol instances. It adds the method getSymbols so that you can
retrieve this list. For example, the tuple [ant] is a BasisSymbol, as
it is uniquely specified with those three BasisSymbol instances a, n
and t. n is a BasisSymbol instance as it is uniquely represented by
itself.
AtomicSymbol instances specialize BasisSymbol by guaranteeing that
getMatches returns a set containing only that instance. That is, they
are indivisable. The DNA nucleotides are instances of AtomicSymbol, as
are individual codons. The stop codon {tag} will have a getMatches
method that returns {tag}, a getBases method that also returns {tag}
and a getSymbols method that returns the List [t, a, g]. {tna} is a
BasisSymbol but not an AtomicSymbol as it matches four AtomicSymbol
instances {taa, tga, tca, tta}. It follows that each symbol in
getSymbols for an AtomicSymbol instance will also be AtomicSymbol
instances.
public interface AtomicSymbol
extends BasisSymbol
A symbol that is not ambiguous.
Atomic symbols are the real underlying elements that a SymbolList is
meant to be composed of. DNA nucleotides are atomic, as are
amino-acids. The getMatches() method should return an alphabet
containing just the one Symbol.
The Symbol instances for single codons would be instances of
AtomicSymbol as they can only be represented as a Set of symbols from
their alphabet that contains just that one symbol.
AtomicSymbol instances guarantee that getMatches returns an Alphabet
containing just that Symbol and each element of the List returned by
getSymbols is also atomic.
public interface BasisSymbol
extends Symbol
A symbol that can be represented as a string of Symbols.
BasisSymbol instances can always be represented uniquely as a single
List of BasisSymbol instances. The symbol N is a BasisSymbol - it can
be uniquely represented by N. It matches {a, g, c, t}. Similarly, the
symbol atn is a BasisSymbol, as it can be uniquely represented with a
single list of symbols [a, t, n]. Its getMatches will return the set
{ata, att, atg, atc}.
The getSymbols method returns the unique list of BasisSymbol instances
that this is composed from. For example, the codon ambiguity symbol
atn will have a getSymbols method that returns the list [a, t, n]. The
getMatches method will return an Alphabet containing each AtomicSymbol
that can be made by expanding the list of BasisSymbol instances.
public interface Alphabet
extends Annotatable
The set of AtomicSymbols which can be concatenated together to make a
SymbolList.
A non-atomic symbol is considered to be contained within this alphabet
if all of the atomic symbols that it could match are members of this
alphabet.
public interface FiniteAlphabet
extends Alphabet
An alphabet over a finite set of Symbols.
This interface makes the distinction between an alphabet over a finite
(and possibly small) number of symbols and an Alphabet over an
infinite (or extremely large) set of symbols. Within a FiniteAlphabet,
the == operator should be sufficient to decide upon equality for all
AtomicSymbol instances.
The alphabet functions as the repository of objects in the fly-weight
design pattern. Only symbols within an alphabet should appear in
object that claim to use the alphabet - otherwise something is in
error.
public interface SymbolList
extends Changeable
A sequence of symbols that belong to an alphabet.
This uses biological coordinates (1 to length).
public interface GappedSymbolList
extends SymbolList
This extends SymbolList with API for manipulating, inserting and
deleting gaps.
You could make a SymbolList that contains gaps directly. However, this
leaves you with a nasty problem if you wish to support gap-edit
operations. Also, the original SymbolList must either be coppied or
lost.
GappedSymbolList solves these problems. It will maintain a
data-structure that places gaps. You can add and remove the gaps by
using the public API.
For gap-insert operations, the insert index is the position that will
become a gap. The symbol currently there will move to a higher index.
To insert leading gaps, add gaps at index 1. To insert trailing gaps,
add gaps at index length+1.
>
**********************
// AGRegex.h
//
// Copyright (c) 2002 Aram Greenman. All rights reserved.
//
// Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
//
// 1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
// 2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
// 3. The name of the author may not be used to endorse or promote
products derived from this software without specific prior written
permission.
//
// THIS SOFTWARE IS PROVIDED BY THE AUTHOR "AS IS" AND ANY EXPRESS OR
IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT,
INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
#import <Foundation/NSObject.h>
#import <Foundation/NSRange.h>
@class AGRegex, NSArray, NSString;
/*!
@enum Options
Options defined for -initWithPattern:options:. Two or more options can
be combined with the bitwise OR operator.
@constant AGRegexCaseInsensitive Matching is case insensitive.
Equivalent to /i in Perl.
@constant AGRegexDotAll Dot metacharacter matches any character
including newline. Equivalent to /s in Perl.
@constant AGRegexExtended Allow whitespace and comments in the pattern.
Equivalent to /x in Perl.
@constant AGRegexLazy Makes greedy quantifiers lazy and lazy
quantifiers greedy. No equivalent in Perl.
@constant AGRegexMultiline Caret and dollar anchors match at newline.
Equivalent to /m in Perl.
*/
enum {
AGRegexCaseInsensitive = 1,
AGRegexDotAll = 2,
AGRegexExtended = 4,
AGRegexLazy = 8,
AGRegexMultiline = 16
};
/*!
@class AGRegexMatch
@abstract A single occurence of a regular expression.
@discussion An AGRegexMatch represents a single occurence of a regular
expression within the target string. The range of each subpattern
within the target string is returned by -range, -rangeAtIndex:, or
-rangeNamed:. The part of the target string that matched each
subpattern is returned by -group, -groupAtIndex:, or -groupNamed:.
*/
@interface AGRegexMatch : NSObject {
AGRegex *regex;
NSString *string;
int *matchv;
int count;
}
/*!
@method count
The number of capturing subpatterns, including the pattern itself. */
- (int)count;
/*!
@method group
Returns the part of the target string that matched the pattern. */
- (NSString *)group;
/*!
@method groupAtIndex:
Returns the part of the target string that matched the subpattern at
the given index or nil if it wasn't matched. The subpatterns are
indexed in order of their opening parentheses, 0 is the entire pattern,
1 is the first capturing subpattern, and so on. */
- (NSString *)groupAtIndex:(int)idx;
/*!
@method groupNamed:
Returns the part of the target string that matched the subpattern of
the given name or nil if it wasn't matched. */
- (NSString *)groupNamed:(NSString *)name;
/*!
@method range
Returns the range of the target string that matched the pattern. */
- (NSRange)range;
/*!
@method rangeAtIndex:
Returns the range of the target string that matched the subpattern at
the given index or {NSNotFound, 0} if it wasn't matched. The
subpatterns are indexed in order of their opening parentheses, 0 is the
entire pattern, 1 is the first capturing subpattern, and so on. */
- (NSRange)rangeAtIndex:(int)idx;
/*!
@method rangeNamed:
Returns the range of the target string that matched the subpattern of
the given name or {NSNotFound, 0} if it wasn't matched. */
- (NSRange)rangeNamed:(NSString *)name;
/*!
@method string
Returns the target string. */
- (NSString *)string;
@end
/*!
@class AGRegex
@abstract An Perl-compatible regular expression class.
@discussion An AGRegex is created with -initWithPattern: or
-initWithPattern:options: or the corresponding class methods
+regexWithPattern: or +regexWithPattern:options:. These take a regular
expression pattern string and the bitwise OR of zero or more option
flags. For example:
<code> AGRegex *regex = [[AGRegex alloc]
initWithPattern:@"(paran|andr)oid"
options:AGRegexCaseInsensitive];</code>
Matching is done with -findInString: or -findInString:range: which look
for the first occurrence of the pattern in the target string and return
an AGRegexMatch or nil if the pattern was not found.
<code> AGRegexMatch *match = [regex
findInString:@"paranoid android"];</code>
A match object returns a captured subpattern by -group, -groupAtIndex:,
or -groupNamed:, or the range of a captured subpattern by -range,
-rangeAtIndex:, or -rangeNamed:. The subpatterns are indexed in order
of their opening parentheses, 0 is the entire pattern, 1 is the first
capturing subpattern, and so on. -count returns the total number of
subpatterns, including the pattern itself. The following prints the
result of our last match case:
<code> for (i = 0; i < [match count]; i++)<br
/>
NSLog(@"%d %@
%@", i, NSStringFromRange([match rangeAtIndex:i]), [match
groupAtIndex:i]);</code>
<code> 0 {0, 8} paranoid<br />
1 {0, 5} paran</code>
If any of the subpatterns didn't match, -groupAtIndex: will return
nil, and -rangeAtIndex: will return {NSNotFound, 0}. For example, if we
change our original pattern to "(?:(paran)|(andr))oid" we will get the
following output:
<code> 0 {0, 8} paranoid<br />
1 {0, 5} paran<br />
2 {2147483647, 0} (null)</code>
-findAllInString: and -findAllInString:range: return an NSArray of all
non-overlapping occurrences of the pattern in the target string.
-findEnumeratorInString: and -findEnumeratorInString:range: return an
NSEnumerator for all non-overlapping occurrences of the pattern in the
target string. For example,
<code> NSArray *all = [regex
findAllInString:@"paranoid android"];</code>
The first object in the returned array is the match case for "paranoid"
and the second object is the match case for "android".
AGRegex provides the methods -replaceWithString:inString: and
-replaceWithString:inString:limit: to perform substitution on strings.
<code> AGRegex *regex = [AGRegex
regexWithPattern:@"remote"];<br />
NSString *result = [regex
replaceWithString:@"complete" inString:@"remote control"]; //
result is "complete control"</code>
Captured subpatterns can be interpolated into the replacement string
using the syntax $x or ${x} where x is the index or name of the
subpattern. $0 and $& both refer to the entire pattern. Additionally,
the case modifier sequences \U...\E, \L...\E, \u, and \l are allowed in
the replacement string. All other escape sequences are handled
literally.
<code> AGRegex *regex = [AGRegex
regexWithPattern:@"[usr]"];<br />
NSString *result = [regex
replaceWithString:@"\\u$&." inString:@"Back in the ussr"];
// result is "Back in the U.S.S.R."</code>
Note that you have to escape a backslash to get it into an NSString
literal.
Named subpatterns may also be used in the pattern and replacement
strings, like in Python.
<code> AGRegex *regex = [AGRegex
regexWithPattern:@"(?P<who>\\w+) is a
(?P<what>\\w+)"];<br />
NSString *result = [regex
replaceWithString:@"Jackie is a $what, $who is a runt"
inString:@"Judy is a punk"]); // result is "Jackie is a punk, Judy
is a runt"</code>
Finally, AGRegex provides -splitString: and -splitString:limit: which
return an NSArray created by splitting the target string at each
occurrence of the pattern. For example:
<code> AGRegex *regex = [AGRegex
regexWithPattern:@"ea?"];<br />
NSArray *result = [regex
splitString:@"Repeater"]; // result is "R", "p", "t", "r"</code>
If there are captured subpatterns, they are returned in the array.
<code> AGRegex *regex = [AGRegex
regexWithPattern:@"e(a)?"];<br />
NSArray *result = [regex
splitString:@"Repeater"]; // result is "R", "p", "a", "t",
"r"</code>
In Perl, this would return "R", undef, "p", "a", "t", undef, "r".
Unfortunately, there is no convenient way to represent this in an
NSArray. (NSNull could be used in place of undef, but then all members
of the array couldn't be expected to be NSStrings.)
*/
@interface AGRegex : NSObject {
void *regex;
void *extra;
int groupCount;
}
/*!
@method regexWithPattern:
Creates a new regex using the given pattern string. Returns nil if the
pattern string is invalid. */
+ (id)regexWithPattern:(NSString *)pat;
/*!
@method regexWithPattern:options:
Creates a new regex using the given pattern string and option flags.
Returns nil if the pattern string is invalid. */
+ (id)regexWithPattern:(NSString *)pat options:(int)opts;
/*!
@method initWithPattern:
Initializes the regex using the given pattern string. Returns nil if
the pattern string is invalid. */
- (id)initWithPattern:(NSString *)pat;
/*!
@method initWithPattern:options:
Initializes the regex using the given pattern string and option flags.
Returns nil if the pattern string is invalid. */
- (id)initWithPattern:(NSString *)pat options:(int)opts;
/*!
@method findInString:
Calls findInString:range: using the full range of the target string. */
- (AGRegexMatch *)findInString:(NSString *)str;
/*!
@method findInString:range:
Returns an AGRegexMatch for the first occurrence of the regex in the
given range of the target string or nil if none is found. */
- (AGRegexMatch *)findInString:(NSString *)str range:(NSRange)r;
/*!
@method findAllInString:
Calls findAllInString:range: using the full range of the target string.
*/
- (NSArray *)findAllInString:(NSString *)str;
/*!
@method findAllInString:range:
Returns an array of all non-overlapping occurrences of the regex in the
given range of the target string. The members of the array are
AGRegexMatches. */
- (NSArray *)findAllInString:(NSString *)str range:(NSRange)r;
/*!
@method findEnumeratorInString:
Calls findEnumeratorInString:range: using the full range of the target
string. */
- (NSEnumerator *)findEnumeratorInString:(NSString *)str;
/*!
@method findEnumeratorInString:range:
Returns an enumerator for all non-overlapping occurrences of the regex
in the given range of the target string. The objects returned by the
enumerator are AGRegexMatches. */
- (NSEnumerator *)findEnumeratorInString:(NSString *)str
range:(NSRange)r;
/*!
@method replaceWithString:inString:
Calls replaceWithString:inString:limit: with no limit. */
- (NSString *)replaceWithString:(NSString *)rep inString:(NSString
*)str;
/*!
@method replaceWithString:inString:limit:
Returns the string created by replacing occurrences of the regex in the
target string with the replacement string. If the limit is positive, no
more than that many replacements will be made.
Captured subpatterns can be interpolated into the replacement string
using the syntax $x or ${x} where x is the index or name of the
subpattern. $0 and $& both refer to the entire pattern.
Additionally, the case modifier sequences \U...\E, \L...\E, \u, and \l
are allowed in the replacement string. All other escape sequences are
handled literally. */
- (NSString *)replaceWithString:(NSString *)rep inString:(NSString
*)str limit:(int)limit;
/*!
@method splitString:
Call splitString:limit: with no limit. */
- (NSArray *)splitString:(NSString *)str;
/*!
@method splitString:limit:
Returns an array of strings created by splitting the target string at
each occurrence of the pattern. If the limit is positive, no more than
that many splits will be made. If there are captured subpatterns, they
are returned in the array. */
- (NSArray *)splitString:(NSString *)str limit:(int)lim;
@end
*********************************************************
** Alexander Griekspoor **
*********************************************************
The Netherlands Cancer Institute
Department of Tumorbiology (H4)
Plesmanlaan 121, 1066 CX, Amsterdam
Tel: + 31 20 - 512 2023
Fax: + 31 20 - 512 2029
AIM: mekentosj at mac.com
E-mail: a.griekspoor at nki.nl
Web: http://www.mekentosj.com
Claiming that the Macintosh is inferior to Windows
because most people use Windows, is like saying
that all other restaurants serve food that is
inferior to McDonalds
*********************************************************
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 26390 bytes
Desc: not available
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20040811/05094ce7/attachment.bin>
More information about the Biococoa-dev
mailing list