[BiO BB] Testing a smith-waterman algorithm?
boris.steipe at utoronto.ca
Sat Mar 11 14:26:37 EST 2006
Cute example. I've changed it slightly to illustrate the main point
using the strings (not words!)
disestabishmentarianism (quoting you, with the deleted "l")
Then I've run them through "needle" and "water" of EMBOSS. (Google
for "EMBOSS GUI").
The local alignment answers the question: "What is the highest region
of similarity between two sequences?". We use that in a database
search to find evidence for homology. We don't require the sequences
to be similar over their whole length, in fact if they only share
some related domains, forcing the non-related sequences to be part of
the comparison would cause problems.
4 estab-ish 11
3 establish 11
The global alignment answers the question: what is the best alignment
of two sequences. We use it when we assume (or would like to test
if ...) the two sequences are related over their whole length.
1 disestab-ish--mentarian------ism 23
..||||| ||| .|:...:|. |||
1 reestablishedfederalagressivism 31
Importantly: the local alignment (Smith-Waterman) shows us only part
of what's actually there, but that part is highlighted more clearly.
So: database search -> local alignment, detailed analysis -> global
alignment (plus taking into account suboptimal alignments as well).
On 11 Mar 2006, at 09:35, Theodore H. Smith wrote:
> Hi people,
> I've successfully designed, written and compiled a program that
> uses the smith-waterman algorithm.
> Nothing new there, but it's for an interesting project, and before
> the project is complete, perhaps some questions asked to
> bioinformaticians can help bring me up to your level.
> The next stage after compiling, is testing my algorithm. I now must
> write some tests for my code.
> This is where I am seeing that I'm unsure if I even understand
> Smith-Waterman properly! I understand Levenshtein OK (similar to
> Needleman-Wunsch), but Smith-Waterman I'm a bit unclear on.
> Mostly I'm wondering exactly how does local matching help us, over
> global matching. I got a lay person's description of why it helps,
> but I'm more interested in getting an exact feel for it.
> Does it make sense to use English words as an example here, instead
> of protein sequences? That would help me understand this a bit
> better, as I have a better feel for English than proteins (unlike
> many of you).
> Would then the main advantage be, for searching for short sequences
> within long ones, without being unfairly penalised by the non-
> matching ends of the long sequence?
> For example: "extrapolate" could match "extra", far better in Smith-
> Waterman than it could using Levenshtein, because we aren't being
> penalised so badly by the "polate" part.
> Or perhaps: "specialisation" would match "lisation" far better
> using local than global, because we aren't being penalised by the
> "specia" part so much.
> Or even: "disestablishmentarianism" would match "establishment" far
> better using local than global, because we aren't being penalised
> by "dis" or "arianism".
> Is that how local searches like Smith-Waterman benefit us?
> What about when we are searching for two long sequences of which
> only a small part will match?
> Let's say "disestabishmentarianism" against
> A local alignment should be able to figure out that "establishment"
> aligns well in this case.
> Is that basically how Smith-Waterman helps us?
> Bioinformatics.Org general forum -
> BiO_Bulletin_Board at bioinformatics.org
More information about the BBB