[Biococoa-dev] More on BCSymbolSets

Mon Feb 28 01:03:06 EST 2005

At 8:05 PM -0500 2/27/05, Koen van der Drift wrote:
>Hi,
>
>Again I was looking at the BCSymbolSet code to implement it more in the BCSequence code. However with the new BCSequence class structure in place I am not so sure yet how to do this. For instance, we have the following method in each subclass:
>
>- (id) initWithString:(NSString *)entry skippingUnknownSymbols:(BOOL)skipFlag;
>
>I guess these are intended to be the designated initializer, although they have not been labeled as such in all classes. Now in BCSymbolSet we have the following (eg for DNA):
>dnaStrictSymbolSet (for C G T A) and dnaSymbolSet (for all possible nucleotides, including the ambiguous ones).

When I updated the sequence initialization methods, I did some cleanup mostly to remove redundant methods, in particular factory methods. However, I kept the init methods exactly the same. There is clearly one issue: there are 2 designated initializers independent of each other 'initWithString:skipUnknownSymbols' and 'initWithSymbolArray:'.

In theory, initWithSymbolArray' would be the best candidate for a designated initializer as it is closer to the actual data structure. See code at the end of the email...

>Similar symbolsets are available for the other sequence types. Both symbolsets are possible in the method above, the skipFlag is not related to either symbolset. So what I can do is, is to test immediately for ambiguous symbols when creating the sequence (using containsAmbiguousSymbols), and based on that set the appropriate symbolset. Or even, to avoid a double iteration, test immediately for isCompoundSymbol when each symbol is added.
>
>I think this code should only go in the designated initializer, because that should be called by all other initializers. Would this be a reasonable approach?

It seems to me that we should use a default symbol set when the user don't want to bother. However, if we use symbol sets, it would make sense to use it to provide their benefits to the user. For instance, the user could decide how she wants to filter an input sequence, and let her choose a symbolSet in the init method. See code at the end of the email...

>Then of course we have the 'unknown symbols' flag. I still am not sure what the purpose of this is. Is it to prevent illegal characters to be converted to a symbol. This could happen if the string contains numbers, or other characters not defined to be symbols.

Like I said, I strictly used what was there. I still wonder when the user would NOT want to skip unknown symbol and replace numbers or weird characters with question marks (undefined symbol).

Maybe this could be implemented in a particular symbolSet, by having a symbolSet e.g. 'dnaSymbolSetWhereUnknownSymbolsAreAllUndefined' (we should find a shorter name!). Symbol sets could return 'symbolWithCharacter:'. Symbol sets would normally return nil if unknown, but this particular symbol set would return the undefined symbol for all unknown characters. See more code at the end of the email...

>I noticed that the implementation for the skip flag is slightly different in the code for proteins versus that for DNA/RNA.
>
>For proteins it looks like:
>
>		if ( (skipFlag==NO) || (aminoAcid!=[BCAminoAcid undefined]) )
>			[tempSequence addObject: aminoAcid];
>
>For DNA/RNA it looks like:
>
>		if ( aBase != [BCNucleotideDNA undefined] )
>			[tempSequence addObject: aBase];
>		else {
>			if ( !skipFlag )
>				[tempSequence addObject: [BCNucleotideDNA undefined]];
>
>
>The protein adds the aminoAcid if skipFlag is NO, the DNA/RNA adds an undefined symbol. I guess we should settle on one, anyone has a preference?

I am responsible for the protein version. If you think more about the code, I believe the result is actually the same! In both cases, if skipFlag=NO, you get the undefined symbol added. The protein code reads : "if I don't care about undefined symbols, or if I care but the symbol is defined, then add the symbol..)

Maybe the code is clearer with:
if (skipFlag==NO) {
    [tempSequence addObject: aBase];
} else {
    if (aBase != [BCNucleotideDNA undefined])
         [tempSequence addObject: aBase];
}

But we could drop the skipFlag and still provide the functionality.

At 10:26 PM -0500 2/27/05, John Timmer wrote:
>Okay, looking at things, we definitely have to try to make things a bit more
>consistent in terms of the init methods.  I think they're mostly holdovers
>from before amino acids had ambiguous members.
>
>I think the idea of using a symbol set to limit the possible options for
>initializing a sequence is a good one, provided we make the process very
>streamlined.  The class itself looks to be good in that regard.
>
>I've got a couple of ideas on the implementation.  One idea that suggests
>itself is to have a BCSequencType variable for the symbol set - that way the
>sequence being initialized could pick up its sequence type from the set it
>gets passed during initialization.

One thing we have to be aware of is that the sequence class is a bit repetitive for the sequence type. We have the class which tells what the type of sequence is (e.g. [BCSequenceDNA class]), the sequenceType ivar (e.g. BCDNASequence), and then possibly the symbolSet now. So we have to be very clear on which of these 3 items is the 'leader'. I seems natural that the sequence class should be the leader and decide what the sequence type is and what the symbol set type should be. I thus agree that symbolSet could have a sequenceType, for that purpose: the sequence class could check that the symbolSet is the right type before using it. Otherwise, it would default to a default symbolSet.

>The other thing I'd do is take the methods like "baseForSymbol" and
>"aaForSymbol" and formalize them to be a single selector, like
>"symbolForCharacter".  That way, you could call the same selector on any
>class.  This would allow you to make code that looked something like this
>(given passedString and passedSet as the arguments):
>TheClass = [passedSet anyObject];
>TheChar = [passedString characterAtIndex: loopCounter];
>TheSymbol = [theClass symbolForCharacter: theChar];
>If ( [passedSet containsObject: theSymbol] )
>    // add it to our sequence

Even better, 'symbolForCharacter:' could actually be a method for the symbolSet. That would be a very good idea to allow for that code:
	TheChar = [passedString characterAtIndex: loopCounter];
	TheSymbol = [passedSet symbolForCharacter: theChar];
	If ( theSymbol!=nil )
   		// add it to our sequence

See more code at the end of the email...

>If we're making symbol sets this central to sequence creation, though, I'd
>make a lot of combinations, rather than the two we have for each type.  We
>don't want any of the commonly used sets more than a single call away.
>Basically, I'd do strict, ambiguous, those with gap, those with undefined,
>those with both, etc.  We may also want to make the standard ones
>singletons.

Very good point! It thens makes really sense to let the user access them.

At 10:41 PM -0500 2/27/05, Koen van der Drift wrote:
><snip>
>>I've got a couple of ideas on the implementation.  One idea that suggests
>>itself is to have a BCSequencType variable for the symbol set - that way the
>>sequence being initialized could pick up its sequence type from the set it
>>gets passed during initialization.
>
>Then we might have to extend the different BCSequenceTypes to include strict, ambiguous, etc.

Actually, the current sequence type is redundant with the sequence class. But it is convenient to make sure we deal with DNA or protein without having to use the inelegant and dangerous 'isKindOfClass:' (which relies on the class name never changing).

The sequence type you propose would be redundant with the symbolSet. It is alreadyeasy to check for a symbolSet with equality to the class singleton, without the need for a sequenceType enum. We could also provide a isEqualToSymbolSet: method. In addition, the user might create other symbol sest that would not be covered by the nomenclature we would put in place. This is probably even more problematic for a fixed enum that could never take that into account.

OK, now, here is what I have in mind. Some real code! This is all that is needed to provide the initializers for ALL the classes (except BCSequenceCodon and BCSequence; for the latter, a slight modification will do). Most of the code is in the subclass.

@interface BCAbstractSequence:NSObject
{
    BCSymbolSet *symbolSet;
    BCSequenceType *sequenceType;
    NSMutableArray *symbolArray;
}

//designated initializer
- (id)initWithSymbolArray:(NSArray *)anArray symbolSet:(BCSymbolSet *)aSet;

- (id)initWithString:(NSString *)aString;
- (id)initWithString:(NSString *)aString symbolSet:(BCSymbolSet *)aSet;

//methods to override in the subclasses
+ (BCSequenceType)sequenceType;
+ (BCSymbolSet *)defaultSymbolSet;

@end

@implementation BCAbstractSequence

- (id)initWithSymbolArray:(NSArray *)anArray symbolSet:(BCSymbolSet *)aSet
{
    self=[super init];
    if (self!=nil) {
          sequenceType=[[self class] sequenceType];
         //check that the symbol set is the right type, otherwise use default
         if ([aSet sequenceType]!=sequenceType)
               aSet=[[self class] defaultSymbolSet];
         //let the set check the symbols
         NSArray *finalArray=[aSet arrayByRemovingUnknownSymbolsFromArray:anArray];
         symbolArray=[[NSMutableArray alloc] initWithArray:finalArray];
    }
    return self;
}

- (id)initWithString:(NSString *)aString;
{
    return [self initWithString:aString symbolSet:[[self class] defaultSymbolSet]];
}

- (id)initWithString:(NSString *)aString symbolSet:(BCSymbolSet *)aSet;
{
    int i,n;
    NSMutableArray *anArray;
    BCSymbol *aSymbol;

    //check that the symbol set is the right type, otherwise use default
    if ([aSet sequenceType]!=[[self class] sequenceType])
         aSet=[[self class] defaultSymbolSet];

    //creates a symbol array
     n=[aString length];
     anArray=[NSMutableArray arrayWithCapacity:[aString length]];
     for (i=0;i++;i<n) {
           unichar aChar=[aString characterAtIndex:i];
           if (aSymbol=[aSet symbolForCharacter:aChar])
                 [anArray addObject:aSymbol];
     }

    //calls the designated initializer
     return [self initWithSymbolArray:anArray symbolSet:aSet];
}

@end

@interface BCSequenceDNA: BCAbstractSequence
{
}

//no need to write any initializers!!!

//methods to override in the subclass
+ (BCSequenceType)sequenceType;
+ (BCSymbolSet *)defaultSymbolSet;

@end

@implementation BCSequenceDNA

+ (BCSequenceType)sequenceType
	{ return BCDNASequence; }

+ (BCSymbolSet *)defaultSymbolSet
	{ return [BCSymbolSet dnaSymbolSet]; }

@end

and similar for RNA and protein... So this is very little code. And this way, the user can still decide to use any compatible symbol set to filter the sequence passed as argument. By default, the symbol set used is the less restrictive. Also the user can't use a non-compatible symbol set, otherwise the initializer uses the default. We do get all the benefits of symbolSets, right?

Additional notes:
* If we implement the sequenceType in symbolSet, we could then check when a symbol is added if it is the right class; then BCSymbol could have a sequenceType for that purpose!
* the symbolSet needs a method 'symbolForCharacter' that returns nil if not found (very similar to NSSet)
* the symbolSet could have a method to filter an NSArray of symbols: 'arrayByRemovingUnknownSymbolsFromArray:'

What do you think of this approach and the code? I tried to use pieces and ideas from your different emails :-)

charles

-- 
Help science go fast forward:
http://cmgm.stanford.edu/~cparnot/xgrid-stanford/

Charles Parnot
charles.parnot at stanford.edu

Room  B157 in Beckman Center
279, Campus Drive
Stanford University
Stanford, CA 94305 (USA)

Tel +1 650 725 7754
Fax +1 650 725 8021