JaMBW Chapter 3.2.1

Word Comparison


Aim

Given one or two sequences of nucleic acids or amino acids, this program allows to visually compare their extent of similarity, as estimated by perfect matching of short segments.

Mode of operation

This program uses two windows for reading the sequence (s) to analyze, two small fields to specify the requested word and step sizes, and returns the results by a 2D plot and a small text line. The following steps must be performed:
  1. Sequence input
  2. Word size
    It indicates the length of the identical polynucleotide or polypeptide that must be found in both sequences in order to generate a dot on the chart. The bigger the word size, the lower is the probability that the same segment is present in both sequences. A word size of 2 can be used to identify di-nucleotide/peptide repeats, of 3 three-nucleotide/peptide repeats and so on. A unary word size will produce the highest number of dots, thus worsening the identification of long stretches of similarity. A word size of 6 applied to nucleic acids results in a random choice probability of 0.025% (0.256), while if applied to protein, it results in a random choice probability of 0,0000015625 %.

  3. Sliding step
    It indicates how to proceed along the sequence for the computation. A slide step of 1 has the effect of computing for each position along the sequence, while a value greater than 1 introduces "jumps" across the sequence.

  4. Compute
    Once the sequence is placed in the top window, by pressing the "COMPUTE" button all the Open Reading Frames are computed and visualized.

A Java-enabled browser would have in this place a window similar to this picture:

How to understand its output

A set of dots will appear, identifying identical elements in the two sequences.

If instead you only analyze a single sequence, the dots shows the repeats of the given size that are present in it.
The "word comparison" is a conceptually very simple analysis which could produce very useful and deep insights. It can be used for analyse both single sequences and pair of sequences:


How to appreciate the "word comparison"


References

  1. Wilbur,W.J. and Lipman,D.J. (1983) Rapid similarity searches of nucleic acid and protein data banks Proc Natl Acad Sci USA80,726-730
  2. Doelz,R.(1990)BioCompanion, Biocomputing Essentials series, ISBN 3-905 434-00-8

Author:Luca I.G. TOLDO, Edition date: 28 February 1997