[Biococoa-dev] Digest tool

John Timmer jtimmer at bellatlantic.net
Sun Feb 27 10:51:32 EST 2005


Okay, as for digests, an NSScanner won¹t work because of all the ambiguity
issues.  Since Alex found that you can¹t load frameworks out of a plugin
bundle, I had to roll my own solution rather than using a REGEX library,
which means this code could be fairly informative.  The digest method itself
is a bit complex, as it formats the output as an attributed string, and
labels sites, positions, etc.  I¹ve cut out the actual site finding code
here.  

First, a rough overview:  I decided for performance reasons to split things
up so that the simplest cases could be handled quickly (and then I went and
used array enumerators because I didn¹t know they had awful performance ­ oh
well).  The first case handles no ambiguity, the second handles ambiguous
bases, the third is when the site includes a stretch of N¹s, and the final
case has ambiguity and N¹s.  In each case, several enzymes may recognize the
same sequence, so there¹s an test and a call to mark everything from the
³same sites² array.  The trick I use for recognizing sites with ambiguous
bases here comes from the use of strings ­ I simply call a method that
builds up an array of all possible sequences, then search for each of them.
For N¹s, I count how many there are and just skip forward that many bases.

Overall, I¹m not sure how well this is going to work as an example, given
our non-string based model.  I think rolling the digests into a single
object is a good idea, but I¹m pretty sure you¹re going to have to make
separate code for proteins and DNA.  On the plus side, the SequenceFinder
class seems pretty well designed for finding sites accounting for ambiguity,
though I haven¹t tested it with ambiguous sequences since the code was
stripped out of the sequence class.  We could start using that for now, and
use Shark to tell us where we could make some improvements.  Off the top of
my head, I¹d imagine inlining the symbol comparison method would help, and a
few of the conditional statements could be taken out in the case where the
sequence being searched had no ambiguous bases.



       /////////////////////////////////////////////////////
        // this is the easiest case -  all bases are defined
        /////////////////////////////////////////////////////
            if ( [validCodonCharacters isSupersetOfSet: theSitesBases] ) {
                siteRange = [theDNASequence rangeOfString: siteString];
                
                while ( siteRange.location != NSNotFound ) {
                    anySitesFound = YES;
                    numberOfCuts++;
                    [self _markSequenceWithName: [anEnzyme objectForKey:
@"enzyme name"] atRange: siteRange];
                    [cutterPositions addObject: [NSNumber numberWithInt:
(siteRange.location + 1)]];
                   
                    if ( markAllSequences && [[anEnzyme objectForKey: @"same
sites"] count] > 0 ) {
                        tempEnumerator = [[anEnzyme objectForKey: @"same
sites"] objectEnumerator];
                        while ( anEnzymeName = [tempEnumerator nextObject] )
{
                            [self _markSequenceWithName: anEnzymeName
atRange: siteRange];
                        }
                    }
                   
                   
                    siteRange = [theDNASequence rangeOfString: siteString
options: NSLiteralSearch range: NSMakeRange( siteRange.location + 1,
[theDNASequence length] - siteRange.location - 2 )];
                }
            }
            else if ( [siteString rangeOfString: @"N"].location ==
NSNotFound ) {
        /////////////////////////////////////////////////////
        // ambiguous bases - we get an array of possible sites and search
for each
        /////////////////////////////////////////////////////
                
                siteArray = [self _getAllSitesFromSequence: siteString];
                siteEnumerator = [siteArray objectEnumerator];
                while (  aPossibleSiteString = [siteEnumerator nextObject] )
{
                    siteRange = [theDNASequence rangeOfString:
aPossibleSiteString];
                   
                    while ( siteRange.location != NSNotFound ) {
                        anySitesFound = YES;
                        numberOfCuts++;
                        [self _markSequenceWithName: [anEnzyme objectForKey:
@"enzyme name"] atRange: siteRange];
                        [cutterPositions addObject: [NSNumber numberWithInt:
(siteRange.location + 1)]];
                   
                        if ( markAllSequences && [[anEnzyme objectForKey:
@"same sites"] count] > 0 ) {
                            tempEnumerator = [[anEnzyme objectForKey: @"same
sites"] objectEnumerator];
                            while ( anEnzymeName = [tempEnumerator
nextObject] ) {
                                [self _markSequenceWithName: anEnzymeName
atRange: siteRange];
                            }
                        }
                   
                        siteRange = [theDNASequence rangeOfString:
aPossibleSiteString options: NSLiteralSearch range: NSMakeRange(
siteRange.location + 1, [theDNASequence length] - siteRange.location - 2 )];
                    }
                }
                
            }
            else  {
                
                    /////////////////////////////////////////////////////
                    // we have N bases, how annoying
                    // first case, it's regular bases with some N's
                    /////////////////////////////////////////////////////
                if ( [normalBasesAndNCharacters isSupersetOfSet:
theSitesBases] ) {
                // thankfully, it's only N's
                    nonNComponentsArray = [siteString
componentsSeparatedByString: @"N"];
                    numberOfNs = [nonNComponentsArray count] - 1;
                   
                    // see if this is worth doing
                    if ( [siteString length] - numberOfNs < minimumSiteSize
) 
                        siteTooSmall = YES;
                    else {
                   
                        leftHalf = [nonNComponentsArray objectAtIndex: 0];
                        rightHalf = [nonNComponentsArray objectAtIndex:
([nonNComponentsArray count] - 1)];
                        locationOfNStart = [leftHalf length];
                   
                        siteRange = [theDNASequence rangeOfString:
leftHalf];
                   
                        while ( siteRange.location != NSNotFound ) {
                            fullSiteRange = NSMakeRange( siteRange.location,
[siteString length] );
                            if ( fullSiteRange.location +
fullSiteRange.length < [theDNASequence length] - 1 ) {
                                tempString = [theDNASequence
substringWithRange: fullSiteRange];
                   
                                if ( [tempString hasSuffix: rightHalf] ) {
                                    anySitesFound = YES;
                                    numberOfCuts++;
                                    [self _markSequenceWithName: [anEnzyme
objectForKey: @"enzyme name"] atRange: siteRange];
                                    [cutterPositions addObject: [NSNumber
numberWithInt: (siteRange.location + 1)]];
                   
                                    if ( markAllSequences && [[anEnzyme
objectForKey: @"same sites"] count] > 0 ) {
                                        tempEnumerator = [[anEnzyme
objectForKey: @"same sites"] objectEnumerator];
                                        while ( anEnzymeName =
[tempEnumerator nextObject] ) {
                                            [self _markSequenceWithName:
anEnzymeName atRange: siteRange];
                                        }
                                    }
                   
                                }
                            }
                            siteRange = [theDNASequence rangeOfString:
leftHalf options: NSLiteralSearch range: NSMakeRange( siteRange.location +
1, [theDNASequence length] - siteRange.location - 2 )];
                        }
                    }
                }
            // worst case - we've got N's and Y's and W's and such.
                else {
                    nonNComponentsArray = [siteString
componentsSeparatedByString: @"N"];
                    numberOfNs = [nonNComponentsArray count] - 1;
                   
                    if ( [siteString length] - numberOfNs < minimumSiteSize
) 
                        siteTooSmall = YES;
                    else {
                   
                        rightHalf = [nonNComponentsArray objectAtIndex:
([nonNComponentsArray count] - 1)];
                        leftHalf = [nonNComponentsArray objectAtIndex: 0];
                   
                        siteArray = [self _getAllSitesFromSequence:
leftHalf];
                        nonNComponentsArray = [self
_getAllSitesFromSequence: rightHalf];
                   
                        locationOfNStart = [leftHalf length];
                   
                   
                        siteEnumerator = [siteArray objectEnumerator];
                        while (  aPossibleSiteString = [siteEnumerator
nextObject] ) {
                   
                            siteRange = [theDNASequence rangeOfString:
aPossibleSiteString];
                            while ( siteRange.location != NSNotFound ) {
                   
                                fullSiteRange = NSMakeRange(
siteRange.location, [siteString length] );
                                if ( fullSiteRange.location +
fullSiteRange.length < [theDNASequence length] - 1 ) {
                                    tempString = [theDNASequence
substringWithRange: fullSiteRange];
                                    tempString = [tempString
substringWithRange: NSMakeRange(locationOfNStart + numberOfNs, [rightHalf
length])];
                   
                                    if ( [nonNComponentsArray
containsObject: tempString] ) {
                                        anySitesFound = YES;
                                        numberOfCuts++;
                                        [self _markSequenceWithName:
[anEnzyme objectForKey: @"enzyme name"] atRange: siteRange];
                                        [cutterPositions addObject:
[NSNumber numberWithInt: (siteRange.location + 1)]];
                   
                                        if ( markAllSequences && [[anEnzyme
objectForKey: @"same sites"] count] > 0 ) {
                                            tempEnumerator = [[anEnzyme
objectForKey: @"same sites"] objectEnumerator];
                                            while ( anEnzymeName =
[tempEnumerator nextObject] ) {
                                                [self _markSequenceWithName:
anEnzymeName atRange: siteRange];
                                            }
                                        }
                   
                                    }
                                }
                                siteRange = [theDNASequence rangeOfString:
aPossibleSiteString options: NSLiteralSearch range: NSMakeRange(
siteRange.location + 1, [theDNASequence length] - siteRange.location - 2 )];
                            }
                        }
                    }
                }
            }
        }




_______________________________________________
This mind intentionally left blank

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20050227/c3edc6b3/attachment.html>


More information about the Biococoa-dev mailing list