JaMBW Chapter 3.2.1

Word Comparison

Aim

Given one or two sequences of nucleic acids or amino acids, this program allows to visually compare their extent of similarity, as estimated by perfect matching of short segments.

Mode of operation

This program uses two windows for reading the sequence (s) to analyze, two small fields to specify the requested word and step sizes, and returns the results by a 2D plot and a small text line. The following steps must be performed:

Sequence input
- Symbols used
  Either paste or type in the top area the sequence of interest. Any character or symbol that does not belong to the [a-zA-Z] set is ignored.
- removal of header information
  Only the sequence must be placed in the top window: heading comments must be removed
- long sequences and small window
  In order to allow users with small screens to still be able of using this program, the size of each window had been made rather small. Therefore, use the scroll-bars in order to move around in the input and output windows. The suggested strategy is to double click in the specified area and then do copy/paste from/to the text-editor of choice or across different applications.
- Single or pair analysis
  If the target of the analysis is to identify repeats in a sequence, then there is no need of pasting it in both windows, since the program will do that automatically if only one window is being used.
Word size
It indicates the length of the identical polynucleotide or polypeptide that must be found in both sequences in order to generate a dot on the chart. The bigger the word size, the lower is the probability that the same segment is present in both sequences. A word size of 2 can be used to identify di-nucleotide/peptide repeats, of 3 three-nucleotide/peptide repeats and so on. A unary word size will produce the highest number of dots, thus worsening the identification of long stretches of similarity. A word size of 6 applied to nucleic acids results in a random choice probability of 0.025% (0.25⁶), while if applied to protein, it results in a random choice probability of 0,0000015625 %.
Sliding step
It indicates how to proceed along the sequence for the computation. A slide step of 1 has the effect of computing for each position along the sequence, while a value greater than 1 introduces "jumps" across the sequence.
Compute
Once the sequence is placed in the top window, by pressing the "COMPUTE" button all the Open Reading Frames are computed and visualized.

A Java-enabled browser would have in this place a window similar to this picture:

How to understand its output

A set of dots will appear, identifying identical elements in the two sequences.

If instead you only analyze a single sequence, the dots shows the repeats of the given size that are present in it.
The "word comparison" is a conceptually very simple analysis which could produce very useful and deep insights. It can be used for analyse both single sequences and pair of sequences:

Single sequence
Analyzing a single sequence with this program allows the identification of repeats in a very straightforward manner. In fact, identical elements of the sequence located in different parts of the whole construct will show up as dots which "join" the same word present in different locations along the sequence. Clicking on the dot will reveal (in the upper window) the matching word and the location on the sequence. The definitions "horizontal" and "vertical sequence" are used in order to recall the user that the drawing ideally represents the word identity along two sequences one placed across the page and one along it.
Pair of sequences
Pairs of sequences analyzed by the "word comparison" allow identification of common elements (defined as short identical stretches ). In this way the comparison of pairs of sequences is very easy and the analysis of the results absolutely straightforward.

How to appreciate the "word comparison"

Word Size
Try to vary the word size and see how the pattern changes. Is there any relation between the word size and the "similarity" of the sequences ? What happen when the word size is very small ? How useful becomes the pattern ? Would you like to try to consider the word size as a "filter" to remove noisy information ? When the Word size is 1 and there is a single sequence in the input what represents each dot ? What is the meaning of the symmetry obtained in the pattern ?
Step size
What happens by varying the step size ? Try to change the step size and the word size and observe how the pattern change. Can you draw any conclusion ? Is also the step size to be considered as "filter" ? How efficient is the filtering when performed by the step size, by the word size and by their combination ?

References

Wilbur,W.J. and Lipman,D.J. (1983) Rapid similarity searches of nucleic acid and protein data banks Proc Natl Acad Sci USA80,726-730
Doelz,R.(1990)BioCompanion, Biocomputing Essentials series, ISBN 3-905 434-00-8

Author:Luca I.G. TOLDO, Edition date: 28 February 1997