[BiO BB] Understanding Smith-Waterman scoring
Theodore H. Smith
delete at elfdata.com
Fri Feb 10 09:13:29 EST 2006
I'm trying to learn about Smith-Waterman. There is one thing I
haven't seen answered in explanations of the Smith-Waterman algorithm.
How does it score alignments that come in sections? Does it give a
penalty if a sequence must be split up?
For example, let's say I had the protein AAAABBBB, and I wanted to
see how this scored against the protein BBBBAAAA. Let's ignore the
fact that it can be reversed, for the moment, just so I can
understand how should Smith-Waterman work.
Now, what would the match score be? Let's assume that A to A has a
score of 1 and B to B also has a score of 1. Its a really simple
example. So matching AAAABBBB to itself, would give a SW score of 8.
What would matching BBBBAAAA to AAAABBBB give?
I'd expect it to generate two "sections", like this:
But what should the overall score be? Is it still 8? Or should we
give a penalty because we've had to split this up? Is it normal for
alignment tools to give penalties to segmented sequences. Also is
there some kind of "minimum length" that a Smith-Waterman based
aligner would allow? Would it say that you can't have sections below
a certain length? Are there any tools which let you specify such a
minimum section length?
If you don't like that example above of AAAABBBB (as it can be
reversed), then try this example. Assume all the proteins get a score
of 1 against themselves. The protein: ABCDEFGH, if I did a Smith-
Waterman score comparison against DCHABGEF, would the score still be
8. After all, all the proteins are there, just in a different order.
I would expect this to get a score of zero or below.
It's a really basic question, sorry about that!
More information about the BBB