[Biococoa-dev] Digest tool
Koen van der Drift
kvddrift at earthlink.net
Sun Feb 27 16:01:30 EST 2005
On Feb 27, 2005, at 10:51 AM, John Timmer wrote:
> Okay, as for digests, an NSScanner won’t work because of all the
> ambiguity issues.
That might indeed be a problem. Maybe we should ressurrect BCScanner
that could help us here?
> Since Alex found that you can’t load frameworks out of a plugin
> bundle, I had to roll my own solution rather than using a REGEX
> library, which means this code could be fairly informative. The
> digest method itself is a bit complex, as it formats the output as an
> attributed string, and labels sites, positions, etc. I’ve cut out the
> actual site finding code here.
I would suggest for the output we create subsequences, and copy all its
intrinsic properties, not only the symbolArray. This way you should get
all annotations and features, which I assume is what you are referring
to.
>
> First, a rough overview: I decided for performance reasons to split
> things up so that the simplest cases could be handled quickly (and
> then I went and used array enumerators because I didn’t know they had
> awful performance – oh well). The first case handles no ambiguity,
> the second handles ambiguous bases, the third is when the site
> includes a stretch of N’s, and the final case has ambiguity and N’s.
> In each case, several enzymes may recognize the same sequence, so
> there’s an test and a call to mark everything from the “same sites”
> array.
That sounds like a good approach. The first two can also be used for
proteins.
> The trick I use for recognizing sites with ambiguous bases here comes
> from the use of strings – I simply call a method that builds up an
> array of all possible sequences, then search for each of them. For
> N’s, I count how many there are and just skip forward that many bases.
In my current code, I also use strings to scan for the characters
representing the cleavage sites. When we decide on using the BCScanner
class, we should be able to scan a symbolArray directly. So no need for
a sequence -> string -> sequence conversion which will speed things up.
- Koen.
More information about the Biococoa-dev
mailing list