[Biodevelopers] Question on gaps vs substitution

Theodore H. Smith delete at elfdata.com
Sat Jul 7 11:59:28 EDT 2007


Hi people,

I have a question on gap penalties and substitution scoring.

The answer to this question could make a factor of around 1.5x  
difference the speed of a blast-like algorithm I'm working on. And  
double the complexity of one small component used in the algorithm.

So, I'm using a Smith-Waterman approach combined with some other  
approaches to speed up Smith-Waterman by a few order of magnitude. So  
my problem area is totally Smith-Waterman based and anyone who knows  
Smith-waterman algorithm might be able to help me.

Is there ever a case where two inserts, could be less costly than a  
substituion? In fact it's not even just "two inserts", it's an insert  
followed by a delete.

Basically, a case where going right, then down, is less costly than  
going diagonally?

So if we had a matrix like this:

... ,  5, 4,  ...
... ,  4, 3,  ...

Let's say in this case, the replacement from the "5" cell, to the "3"  
cell would have taken 7 points, leaving us with -2 (probably would be  
raised to zero as how S-W works). But going right from 5 to 4 only  
took 1 point away, and then going down from 4 to 3 took 1 point.

This would require that an insert followed by a deletion, is cheaper  
than a replacement.

Can this be true?

I've inspected the BLOSUM62 matrix (-4 is the worst penalty for  
replacement), and the default gap penalties (-11,-1), and it seems  
like using BLOSUM62, it's impossible to get "insert+delete" instead  
of one replacement.

So this would be a good case for me, as it would simplify my  
algorithm. I'm happier if insert+delete is always worse than  
replacement.

But does this pattern hold across all biological uses? Can there be a  
case where an insert followed by a delete, is better than a  
replacement? If so, would it be something so contrived and unlikely  
and not really useful, that I can just ignore it and tell my users  
that my software has a design restriction that insert+delete must  
always score worse than a substitution?

Sorry for this basic question! I know I am asking a basic question,  
but then it really does help to have your basics correct before  
forming some super-fast algorithm. I can't have it super-fast and  
wrong now can I :) Basics always come first.

Also, if an insert is followed by a delete, does it count as -11-11  
(-22) or -11-1 (-12)?

Thanks to all who can help! My algorithm is really close to  
finishing. I can see it almost in my hands. I will be very proud to  
finish it. Whether or not the algorithm fast enough to be is of use,  
is another matter, but it will function correctly at least!

--
http://elfdata.com/plugin/
"String processing, done right"




More information about the Biodevelopers mailing list