<HTML>
<HEAD>
<TITLE>Re: [Biococoa-dev] Digest tool</TITLE>
</HEAD>
<BODY>
<FONT FACE="Verdana, Helvetica, Arial"><SPAN STYLE='font-size:12.0px'>Okay, as for digests, an NSScanner won’t work because of all the ambiguity issues.  Since Alex found that you can’t load frameworks out of a plugin bundle, I had to roll my own solution rather than using a REGEX library, which means this code could be fairly informative.  The digest method itself is a bit complex, as it formats the output as an attributed string, and labels sites, positions, etc.  I’ve cut out the actual site finding code here.  <BR>
<BR>
First, a rough overview:  I decided for performance reasons to split things up so that the simplest cases could be handled quickly (and then I went and used array enumerators because I didn’t know they had awful performance – oh well).  The first case handles no ambiguity, the second handles ambiguous bases, the third is when the site includes a stretch of N’s, and the final case has ambiguity and N’s.  In each case, several enzymes may recognize the same sequence, so there’s an test and a call to mark everything from the “same sites” array.  The trick I use for recognizing sites with ambiguous bases here comes from the use of strings – I simply call a method that builds up an array of all possible sequences, then search for each of them.  For N’s, I count how many there are and just skip forward that many bases.<BR>
<BR>
Overall, I’m not sure how well this is going to work as an example, given our non-string based model.  I think rolling the digests into a single object is a good idea, but I’m pretty sure you’re going to have to make separate code for proteins and DNA.  On the plus side, the SequenceFinder class seems pretty well designed for finding sites accounting for ambiguity, though I haven’t tested it with ambiguous sequences since the code was stripped out of the sequence class.  We could start using that for now, and use Shark to tell us where we could make some improvements.  Off the top of my head, I’d imagine inlining the symbol comparison method would help, and a few of the conditional statements could be taken out in the case where the sequence being searched had no ambiguous bases. <BR>
<BR>
<BR>
<BR>
</SPAN></FONT><FONT SIZE="2"><FONT FACE="Monaco, Courier New"><SPAN STYLE='font-size:10.0px'>       <FONT COLOR="#006E24">/////////////////////////////////////////////////////<BR>
</FONT>        <FONT COLOR="#006E24">// this is the easiest case -  all bases are defined<BR>
</FONT>        <FONT COLOR="#006E24">/////////////////////////////////////////////////////<BR>
</FONT>            <FONT COLOR="#921050">if</FONT> ( [validCodonCharacters isSupersetOfSet: theSitesBases] ) {<BR>
                siteRange = [theDNASequence rangeOfString: siteString];<BR>
                <BR>
                <FONT COLOR="#921050">while</FONT> ( siteRange.location != NSNotFound ) {<BR>
                    anySitesFound = <FONT COLOR="#921050">YES</FONT>;<BR>
                    numberOfCuts++;<BR>
                    [<FONT COLOR="#921050">self</FONT> _markSequenceWithName: [anEnzyme objectForKey: <FONT COLOR="#A81514">@"enzyme name"</FONT>] atRange: siteRange];<BR>
                    [cutterPositions addObject: [NSNumber numberWithInt: (siteRange.location + <FONT COLOR="#1C00FF">1</FONT>)]];<BR>
                    <BR>
                    <FONT COLOR="#921050">if</FONT> ( markAllSequences && [[anEnzyme objectForKey: <FONT COLOR="#A81514">@"same sites"</FONT>] count] > <FONT COLOR="#1C00FF">0</FONT> ) {<BR>
                        tempEnumerator = [[anEnzyme objectForKey: <FONT COLOR="#A81514">@"same sites"</FONT>] objectEnumerator];<BR>
                        <FONT COLOR="#921050">while</FONT> ( anEnzymeName = [tempEnumerator nextObject] ) {<BR>
                            [<FONT COLOR="#921050">self</FONT> _markSequenceWithName: anEnzymeName atRange: siteRange];<BR>
                        }<BR>
                    }<BR>
                        <BR>
                    <BR>
                    siteRange = [theDNASequence rangeOfString: siteString options: NSLiteralSearch range: NSMakeRange( siteRange.location + <FONT COLOR="#1C00FF">1</FONT>, [theDNASequence length] - siteRange.location - <FONT COLOR="#1C00FF">2</FONT> )];<BR>
                }<BR>
            }<BR>
            <FONT COLOR="#921050">else</FONT> <FONT COLOR="#921050">if</FONT> ( [siteString rangeOfString: <FONT COLOR="#A81514">@"N"</FONT>].location == NSNotFound ) {<BR>
        <FONT COLOR="#006E24">/////////////////////////////////////////////////////<BR>
</FONT>        <FONT COLOR="#006E24">// ambiguous bases - we get an array of possible sites and search for each<BR>
</FONT>        <FONT COLOR="#006E24">/////////////////////////////////////////////////////<BR>
</FONT>                <BR>
                siteArray = [<FONT COLOR="#921050">self</FONT> _getAllSitesFromSequence: siteString];<BR>
                siteEnumerator = [siteArray objectEnumerator];<BR>
                <FONT COLOR="#921050">while</FONT> (  aPossibleSiteString = [siteEnumerator nextObject] ) {<BR>
                    siteRange = [theDNASequence rangeOfString:  aPossibleSiteString];<BR>
                    <BR>
                    <FONT COLOR="#921050">while</FONT> ( siteRange.location != NSNotFound ) {<BR>
                        anySitesFound = <FONT COLOR="#921050">YES</FONT>;<BR>
                        numberOfCuts++;<BR>
                        [<FONT COLOR="#921050">self</FONT> _markSequenceWithName: [anEnzyme objectForKey: <FONT COLOR="#A81514">@"enzyme name"</FONT>] atRange: siteRange];<BR>
                        [cutterPositions addObject: [NSNumber numberWithInt: (siteRange.location + <FONT COLOR="#1C00FF">1</FONT>)]];<BR>
                        <BR>
                        <FONT COLOR="#921050">if</FONT> ( markAllSequences && [[anEnzyme objectForKey: <FONT COLOR="#A81514">@"same sites"</FONT>] count] > <FONT COLOR="#1C00FF">0</FONT> ) {<BR>
                            tempEnumerator = [[anEnzyme objectForKey: <FONT COLOR="#A81514">@"same sites"</FONT>] objectEnumerator];<BR>
                            <FONT COLOR="#921050">while</FONT> ( anEnzymeName = [tempEnumerator nextObject] ) {<BR>
                                [<FONT COLOR="#921050">self</FONT> _markSequenceWithName: anEnzymeName atRange: siteRange];<BR>
                            }<BR>
                        }<BR>
                        <BR>
                        siteRange = [theDNASequence rangeOfString:  aPossibleSiteString options: NSLiteralSearch range: NSMakeRange( siteRange.location + <FONT COLOR="#1C00FF">1</FONT>, [theDNASequence length] - siteRange.location - <FONT COLOR="#1C00FF">2</FONT> )];<BR>
                    }<BR>
                }<BR>
                <BR>
            }<BR>
            <FONT COLOR="#921050">else</FONT>  { <BR>
                <BR>
                    <FONT COLOR="#006E24">/////////////////////////////////////////////////////<BR>
</FONT>                    <FONT COLOR="#006E24">// we have N bases, how annoying<BR>
</FONT>                    <FONT COLOR="#006E24">// first case, it's regular bases with some N's<BR>
</FONT>                    <FONT COLOR="#006E24">/////////////////////////////////////////////////////<BR>
</FONT>                <FONT COLOR="#921050">if</FONT> ( [normalBasesAndNCharacters isSupersetOfSet: theSitesBases] ) {<BR>
                <FONT COLOR="#006E24">// thankfully, it's only N's    <BR>
</FONT>                    nonNComponentsArray = [siteString componentsSeparatedByString: <FONT COLOR="#A81514">@"N"</FONT>];<BR>
                    numberOfNs = [nonNComponentsArray count] - <FONT COLOR="#1C00FF">1</FONT>;<BR>
                    <BR>
                    <FONT COLOR="#006E24">// see if this is worth doing <BR>
</FONT>                    <FONT COLOR="#921050">if</FONT> ( [siteString length] - numberOfNs < minimumSiteSize ) <BR>
                        siteTooSmall = <FONT COLOR="#921050">YES</FONT>;<BR>
                    <FONT COLOR="#921050">else</FONT> {<BR>
                        <BR>
                        leftHalf = [nonNComponentsArray objectAtIndex: <FONT COLOR="#1C00FF">0</FONT>];<BR>
                        rightHalf = [nonNComponentsArray objectAtIndex: ([nonNComponentsArray count] - <FONT COLOR="#1C00FF">1</FONT>)];<BR>
                        locationOfNStart = [leftHalf length];<BR>
                        <BR>
                        siteRange = [theDNASequence rangeOfString: leftHalf];<BR>
                        <BR>
                        <FONT COLOR="#921050">while</FONT> ( siteRange.location != NSNotFound ) {<BR>
                            fullSiteRange = NSMakeRange( siteRange.location, [siteString length] );<BR>
                            <FONT COLOR="#921050">if</FONT> ( fullSiteRange.location + fullSiteRange.length < [theDNASequence length] - <FONT COLOR="#1C00FF">1</FONT> ) {<BR>
                                tempString = [theDNASequence substringWithRange: fullSiteRange];<BR>
                                <BR>
                                <FONT COLOR="#921050">if</FONT> ( [tempString hasSuffix: rightHalf] ) {<BR>
                                    anySitesFound = <FONT COLOR="#921050">YES</FONT>;<BR>
                                    numberOfCuts++;<BR>
                                    [<FONT COLOR="#921050">self</FONT> _markSequenceWithName: [anEnzyme objectForKey: <FONT COLOR="#A81514">@"enzyme name"</FONT>] atRange: siteRange];<BR>
                                    [cutterPositions addObject: [NSNumber numberWithInt: (siteRange.location + <FONT COLOR="#1C00FF">1</FONT>)]];<BR>
                                    <BR>
                                    <FONT COLOR="#921050">if</FONT> ( markAllSequences && [[anEnzyme objectForKey: <FONT COLOR="#A81514">@"same sites"</FONT>] count] > <FONT COLOR="#1C00FF">0</FONT> ) {<BR>
                                        tempEnumerator = [[anEnzyme objectForKey: <FONT COLOR="#A81514">@"same sites"</FONT>] objectEnumerator];<BR>
                                        <FONT COLOR="#921050">while</FONT> ( anEnzymeName = [tempEnumerator nextObject] ) {<BR>
                                            [<FONT COLOR="#921050">self</FONT> _markSequenceWithName: anEnzymeName atRange: siteRange];<BR>
                                        }<BR>
                                    }<BR>
                                    <BR>
                                }<BR>
                            }<BR>
                            siteRange = [theDNASequence rangeOfString:  leftHalf options: NSLiteralSearch range: NSMakeRange( siteRange.location + <FONT COLOR="#1C00FF">1</FONT>, [theDNASequence length] - siteRange.location - <FONT COLOR="#1C00FF">2</FONT> )];<BR>
                        }<BR>
                    }<BR>
                }<BR>
            <FONT COLOR="#006E24">// worst case - we've got N's and Y's and W's and such.<BR>
</FONT>                <FONT COLOR="#921050">else</FONT> { <BR>
                    nonNComponentsArray = [siteString componentsSeparatedByString: <FONT COLOR="#A81514">@"N"</FONT>];<BR>
                    numberOfNs = [nonNComponentsArray count] - <FONT COLOR="#1C00FF">1</FONT>;<BR>
                    <BR>
                    <FONT COLOR="#921050">if</FONT> ( [siteString length] - numberOfNs < minimumSiteSize ) <BR>
                        siteTooSmall = <FONT COLOR="#921050">YES</FONT>;<BR>
                    <FONT COLOR="#921050">else</FONT> {<BR>
                        <BR>
                        rightHalf = [nonNComponentsArray objectAtIndex: ([nonNComponentsArray count] - <FONT COLOR="#1C00FF">1</FONT>)];<BR>
                        leftHalf = [nonNComponentsArray objectAtIndex: <FONT COLOR="#1C00FF">0</FONT>];<BR>
                        <BR>
                        siteArray = [<FONT COLOR="#921050">self</FONT> _getAllSitesFromSequence: leftHalf];<BR>
                        nonNComponentsArray = [<FONT COLOR="#921050">self</FONT> _getAllSitesFromSequence: rightHalf];<BR>
                        <BR>
                        locationOfNStart = [leftHalf length];<BR>
                        <BR>
                        <BR>
                        siteEnumerator = [siteArray objectEnumerator];<BR>
                        <FONT COLOR="#921050">while</FONT> (  aPossibleSiteString = [siteEnumerator nextObject] ) {<BR>
                            <BR>
                            siteRange = [theDNASequence rangeOfString: aPossibleSiteString];<BR>
                            <FONT COLOR="#921050">while</FONT> ( siteRange.location != NSNotFound ) {<BR>
                                <BR>
                                fullSiteRange = NSMakeRange( siteRange.location, [siteString length] );<BR>
                                <FONT COLOR="#921050">if</FONT> ( fullSiteRange.location + fullSiteRange.length < [theDNASequence length] - <FONT COLOR="#1C00FF">1</FONT> ) {<BR>
                                    tempString = [theDNASequence substringWithRange: fullSiteRange];<BR>
                                    tempString = [tempString substringWithRange: NSMakeRange(locationOfNStart + numberOfNs, [rightHalf length])];<BR>
                                    <BR>
                                    <FONT COLOR="#921050">if</FONT> ( [nonNComponentsArray containsObject: tempString] ) {<BR>
                                        anySitesFound = <FONT COLOR="#921050">YES</FONT>;<BR>
                                        numberOfCuts++;<BR>
                                        [<FONT COLOR="#921050">self</FONT> _markSequenceWithName: [anEnzyme objectForKey: <FONT COLOR="#A81514">@"enzyme name"</FONT>] atRange: siteRange];<BR>
                                        [cutterPositions addObject: [NSNumber numberWithInt: (siteRange.location + <FONT COLOR="#1C00FF">1</FONT>)]];<BR>
                                        <BR>
                                        <FONT COLOR="#921050">if</FONT> ( markAllSequences && [[anEnzyme objectForKey: <FONT COLOR="#A81514">@"same sites"</FONT>] count] > <FONT COLOR="#1C00FF">0</FONT> ) {<BR>
                                            tempEnumerator = [[anEnzyme objectForKey: <FONT COLOR="#A81514">@"same sites"</FONT>] objectEnumerator];<BR>
                                            <FONT COLOR="#921050">while</FONT> ( anEnzymeName = [tempEnumerator nextObject] ) {<BR>
                                                [<FONT COLOR="#921050">self</FONT> _markSequenceWithName: anEnzymeName atRange: siteRange];<BR>
                                            }<BR>
                                        }<BR>
                                        <BR>
                                    }<BR>
                                }<BR>
                                siteRange = [theDNASequence rangeOfString:  aPossibleSiteString options: NSLiteralSearch range: NSMakeRange( siteRange.location + <FONT COLOR="#1C00FF">1</FONT>, [theDNASequence length] - siteRange.location - <FONT COLOR="#1C00FF">2</FONT> )];<BR>
                            }<BR>
                        }<BR>
                    }<BR>
                }<BR>
            }<BR>
        }<BR>
</SPAN></FONT></FONT><FONT FACE="Verdana, Helvetica, Arial"><SPAN STYLE='font-size:12.0px'><BR>
<BR>
<BR>
</SPAN></FONT><SPAN STYLE='font-size:12.0px'><FONT FACE="Georgia, Times New Roman"><BR>
</FONT><FONT FACE="Verdana, Helvetica, Arial">_______________________________________________<BR>
This mind intentionally left blank<BR>
</FONT></SPAN>
</BODY>
</HTML>