[BiO BB] Testing a smith-waterman algorithm?

Sat Mar 11 14:26:37 EST 2006

Cute example. I've changed it slightly to illustrate the main point  
using the strings (not words!)

disestabishmentarianism (quoting you, with the deleted "l")
reestablishedfederalagressivism

Then I've run them through "needle" and "water" of EMBOSS. (Google  
for "EMBOSS GUI").

The local alignment answers the question: "What is the highest region  
of similarity between two sequences?". We use that in a database  
search to find evidence for homology. We don't require the sequences  
to be similar over their whole length, in fact if they only share  
some related domains, forcing the non-related sequences to be part of  
the comparison would cause problems.

  4 estab-ish     11
    ||||| |||
  3 establish     11

The global alignment answers the question: what is the best alignment  
of two sequences. We use it when we assume (or would like to test  
if ...) the two sequences are related over their whole length.

  1 disestab-ish--mentarian------ism     23
     ..||||| |||  .|:...:|.      |||
  1  reestablishedfederalagressivism     31

Importantly: the local alignment (Smith-Waterman) shows us only part  
of what's actually there, but that part is highlighted more clearly.  
So: database search -> local alignment, detailed analysis -> global  
alignment (plus taking into account suboptimal alignments as well).

HTH,
B.

On 11 Mar 2006, at 09:35, Theodore H. Smith wrote:

>
> Hi people,
>
> I've successfully designed, written and compiled a program that  
> uses the smith-waterman algorithm.
>
> Nothing new there, but it's for an interesting project, and before  
> the project is complete, perhaps some questions asked to  
> bioinformaticians can help bring me up to your level.
>
> The next stage after compiling, is testing my algorithm. I now must  
> write some tests for my code.
>
> This is where I am seeing that I'm unsure if I even understand  
> Smith-Waterman properly! I understand Levenshtein OK (similar to  
> Needleman-Wunsch), but Smith-Waterman I'm a bit unclear on.
>
> Mostly I'm wondering exactly how does local matching help us, over  
> global matching. I got a lay person's description of why it helps,  
> but I'm more interested in getting an exact feel for it.
>
> Does it make sense to use English words as an example here, instead  
> of protein sequences? That would help me understand this a bit  
> better, as I have a better feel for English than proteins (unlike  
> many of you).
>
> Would then the main advantage be, for searching for short sequences  
> within long ones, without being unfairly penalised by the non- 
> matching ends of the long sequence?
>
> For example: "extrapolate" could match "extra", far better in Smith- 
> Waterman than it could using Levenshtein, because we aren't being  
> penalised so badly by the "polate" part.
>
> Or perhaps: "specialisation" would match "lisation" far better  
> using local than global, because we aren't being penalised by the  
> "specia" part so much.
>
> Or even: "disestablishmentarianism" would match "establishment" far  
> better using local than global, because we aren't being penalised  
> by "dis" or "arianism".
>
> Is that how local searches like Smith-Waterman benefit us?
>
>
> What about when we are searching for two long sequences of which  
> only a small part will match?
>
> Let's say "disestabishmentarianism" against  
> "reestablishmentSomeNonMatchingPart".
>
> A local alignment should be able to figure out that "establishment"  
> aligns well in this case.
>
> Is that basically how Smith-Waterman helps us?
>
> --
> http://elfdata.com/plugin/
>
>
>
> _______________________________________________
> Bioinformatics.Org general forum  -   
> BiO_Bulletin_Board at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bio_bulletin_board