[Biococoa-dev] Digest tool

Koen van der Drift kvddrift at earthlink.net
Sun Feb 27 16:01:30 EST 2005


On Feb 27, 2005, at 10:51 AM, John Timmer wrote:

> Okay, as for digests, an NSScanner won’t work because of all the 
> ambiguity issues.

That might indeed be a problem. Maybe we should ressurrect BCScanner 
that could help us here?


>  Since Alex found that you can’t load frameworks out of a plugin 
> bundle, I had to roll my own solution rather than using a REGEX 
> library, which means this code could be fairly informative.  The 
> digest method itself is a bit complex, as it formats the output as an 
> attributed string, and labels sites, positions, etc.  I’ve cut out the 
> actual site finding code here.

I would suggest for the output we create subsequences, and copy all its 
intrinsic properties, not only the symbolArray. This way you should get 
all annotations and features, which I assume is what you are referring 
to.

>
>  First, a rough overview:  I decided for performance reasons to split 
> things up so that the simplest cases could be handled quickly (and 
> then I went and used array enumerators because I didn’t know they had 
> awful performance – oh well).  The first case handles no ambiguity, 
> the second handles ambiguous bases, the third is when the site 
> includes a stretch of N’s, and the final case has ambiguity and N’s. 
>  In each case, several enzymes may recognize the same sequence, so 
> there’s an test and a call to mark everything from the “same sites” 
> array.

That sounds like a good approach. The first two can also be used for 
proteins.

> The trick I use for recognizing sites with ambiguous bases here comes 
> from the use of strings – I simply call a method that builds up an 
> array of all possible sequences, then search for each of them.  For 
> N’s, I count how many there are and just skip forward that many bases.

In my current code, I also use strings to scan for the characters 
representing the cleavage sites. When we decide on using the BCScanner 
class, we should be able to scan a symbolArray directly. So no need for 
a sequence -> string -> sequence conversion which will speed things up.

- Koen.



More information about the Biococoa-dev mailing list