[Biococoa-dev] Digest tool

Sun Feb 27 17:07:15 EST 2005

> 
> On Feb 27, 2005, at 10:51 AM, John Timmer wrote:
> 
>> he first case handles no ambiguity, the second handles ambiguous
>> bases, the third is when the site includes a stretch of N¹s, and the
>> final case has ambiguity and N¹s.  In each case, several enzymes may
>> recognize the same sequence, so there¹s an test and a call to mark
>> everything from the ³same sites² array.
> 
> Just thinking aloud here. Maybe if we design the appropriate
> BCSymbolSet for each case, we can just have one method (or while-loop
> in John's snippet). I haven't studied his code that closely, though, so
> I might be way of. However in any case, the combination
> BCScanner/BCSymbolSet to replace NSScanner/NCCharacterSet is definitely
> something to keep in mind.

Well, the big dividing line between "easy" and "annoying" is handled by
BCAbstractSequence's "containsAmbiguousSymbols", which is fairly optimized.
The line between "just annoying" and "truly a PITA" is whether the sequence
contains Ns, which should be an easy test to write and optimize.  Since you
may be performing a restriction map with hundreds of enzymes, it's probably
worth making these tests very low overhead if we find they're needed.  If
symbolSets do that, then great.

Again, I'd say that we try the find methods that we already have in place
and then profile the results before deciding what to do.  Not that I'm
saying BCScanner shouldn't be implemented, just that we may not need it for
this case.  

One thing I'm seeing with the "findSequence" method in the SequenceFinder
tool is that it handles ambiguity in both directions (in the target and
query sequences).  I'd think with a restriction site map, we wouldn't want
to mark a possible site that may or may not match because of an ambiguous
base in the target sequence.  This would cut down the matching we have to do
and potentially speed things ups considerably, but would require a custom
method.

JT

_______________________________________________
This mind intentionally left blank