JaMBW Chapter 3.2.1
Word Comparison
Aim
Given one or two sequences of nucleic acids or amino acids, this program
allows to visually compare their extent of similarity, as estimated
by perfect matching of short segments.
Mode of operation
This program uses two windows for reading the sequence (s) to analyze,
two small fields to specify the requested word and step sizes, and returns
the results by a 2D plot and a small text line. The following steps must
be performed:
- Sequence input
- Symbols used
Either paste or type in the top area the sequence of interest. Any
character or symbol that does not belong to the [a-zA-Z] set is
ignored.
- removal of header information
Only the sequence must be placed in the top window:
heading comments must be removed
- long sequences and small window
In order to allow users with small screens to still be able of using this
program, the size of each window had been made rather small. Therefore,
use the scroll-bars in order to move around in the input and output windows.
The suggested strategy is to double click in the specified area and then
do copy/paste from/to the text-editor of choice or across different
applications.
- Single or pair analysis
If the target of the analysis is to identify repeats in a sequence, then
there is no need of pasting it in both windows, since the program will do
that automatically if only one window is being used.
- Word size
It indicates the length of the identical polynucleotide or polypeptide
that must be found in both sequences in order to generate a dot on the
chart. The bigger the word size, the lower is the probability that the
same segment is present in both sequences. A word size of 2 can be used
to identify di-nucleotide/peptide repeats, of 3 three-nucleotide/peptide
repeats and so on. A unary word size will produce the highest number of
dots, thus worsening the identification of long stretches of similarity.
A word size of 6 applied to nucleic acids results in a random choice
probability of 0.025% (0.256), while if applied
to protein, it results in a random choice probability of 0,0000015625 %.
- Sliding step
It indicates how to proceed along the sequence for the computation. A slide
step of 1 has the effect of computing for each
position along the sequence, while a value greater than 1 introduces "jumps"
across the sequence.
- Compute
Once the sequence is placed in the top window, by pressing the "COMPUTE"
button all the Open Reading Frames are computed and visualized.
How to understand its output
A set of dots will appear, identifying identical elements in the two sequences.
If instead you only analyze a single sequence, the dots shows the repeats of the given size that are present in it.
The "word comparison" is a conceptually very simple analysis which could produce very
useful and deep insights. It can be used for analyse both single sequences and pair of
sequences:
- Single sequence
Analyzing a single sequence with this program allows the identification of repeats in a very straightforward manner.
In fact, identical elements of the sequence located in different parts of the whole construct will show up as dots
which "join" the same word present in different locations along the sequence. Clicking on the dot will
reveal (in the upper window) the matching word and the location on the sequence. The definitions "horizontal" and "vertical sequence"
are used in order to recall the user that the drawing ideally represents the word identity along two sequences one placed
across the page and one along it.
- Pair of sequences
Pairs of sequences analyzed by the "word comparison" allow identification of common elements (defined as short identical stretches
). In this way the comparison of pairs of sequences is very easy and the analysis of the results
absolutely straightforward.
How to appreciate the "word comparison"
- Word Size
Try to vary the word size and see how the pattern changes. Is there any relation between the word size and the "similarity"
of the sequences ? What happen when the word size is very small ? How useful becomes the pattern ? Would you like to try
to consider the word size as a "filter" to remove noisy information ? When the Word size is 1 and there is a single sequence
in the input what represents each dot ? What is the meaning of the symmetry obtained in the pattern ?
- Step size
What happens by varying the step size ? Try to change the step size and the word size and observe how the pattern change.
Can you draw any conclusion ? Is also the step size to be considered as "filter" ? How efficient is the filtering when performed by
the step size, by the word size and by their combination ?
References
- Wilbur,W.J. and Lipman,D.J. (1983) Rapid similarity searches of nucleic acid and protein data banks Proc Natl Acad Sci USA80,726-730
- Doelz,R.(1990)BioCompanion, Biocomputing Essentials series, ISBN 3-905 434-00-8
Author:Luca I.G. TOLDO,
Edition date: 28 February 1997