Sequence alignment

From Bioinformatics.Org Wiki

(Difference between revisions)
Jump to: navigation, search
(Article Created)
m (Reverted edits by Yxuhehybyja (Talk) to last revision by Admin)
 
(11 intermediate revisions not shown)
Line 1: Line 1:
When two symbolic representations of DNA or protein sequences are arranged next to one another so that their most similar elements are juxtaposed they are said to be '''aligned'''. Many bioinformatics tasks depend upon successful alignments. Alignments are conventionally shown as a '''traces'''.
When two symbolic representations of DNA or protein sequences are arranged next to one another so that their most similar elements are juxtaposed they are said to be '''aligned'''. Many bioinformatics tasks depend upon successful alignments. Alignments are conventionally shown as a '''traces'''.
-
In a symbolic sequence each base or residue monomer in each sequence is represented by a letter. The convention is to print the single-letter codes for the constituent monomers in order in a fixed font (from the N-most to C-most end of the protein sequence in question or from 5' to 3' of a nucleic acid molecule). This is based on the assumption that the combined monomers evenly spaced along the single dimension of the molecule's primary structure. From now on I shall refer to an alignment of two protein sequences.
+
In a symbolic sequence each base or residue monomer in each sequence is represented by a letter. The convention is to print the single-letter codes for the constituent monomers in order in a fixed font (from the N-most to C-most end of the protein sequence in question or from 5' to 3' of a nucleic acid molecule). This is based on the assumption that the combined monomers evenly spaced along the single dimension of the molecule's primary structure. From now on we will refer to an alignment of two protein sequences.
Every element in a trace is either a '''match''' or a '''gap'''. Where a residue in one of two aligned sequences is identical to its counterpart in the other the corresponding amino-acid letter codes in the two sequences are vertically aligned in the trace: a match. When a residue in one sequence seems to have been deleted since the assumed divergence of the sequence from its counterpart, its "absence" is labelled by a dash in the derived sequence. When a residue appears to have been inserted to produce a longer sequence a dash appears opposite in the unaugmented sequence. Since these dashes represent "gaps" in one or other sequence, the action of inserting such spacers is known as '''gapping'''.
Every element in a trace is either a '''match''' or a '''gap'''. Where a residue in one of two aligned sequences is identical to its counterpart in the other the corresponding amino-acid letter codes in the two sequences are vertically aligned in the trace: a match. When a residue in one sequence seems to have been deleted since the assumed divergence of the sequence from its counterpart, its "absence" is labelled by a dash in the derived sequence. When a residue appears to have been inserted to produce a longer sequence a dash appears opposite in the unaugmented sequence. Since these dashes represent "gaps" in one or other sequence, the action of inserting such spacers is known as '''gapping'''.
Line 9: Line 9:
==Biological interpretation of an alignment==
==Biological interpretation of an alignment==
-
A trace can represent a '''substitution'''<nowiki>:</nowiki>
+
A trace can represent a '''substitution''':
-
<blockquote>
+
<pre>
-
 
+
-
+
  AKVAIL
  AKVAIL
-
 
-
 
  AKIAIL
  AKIAIL
 +
</pre>
-
</blockquote>
+
A trace can represent a '''deletion''':
-
A trace can represent a '''deletion'''<nowiki>:</nowiki>
+
<pre>
-
 
+
-
<blockquote>
+
-
 
+
-
+
  VCGMD
  VCGMD
-
 
-
 
  VCG-D
  VCG-D
 +
</pre>
-
</blockquote>
+
A trace can represent a '''insertion''':
-
A trace can represent a '''insertion'''<nowiki>:</nowiki>
+
<pre>
 +
GS-K
 +
GSGK
 +
</pre>
-
<blockquote>
+
For obvious reasons we do not represent a silent mutation.
-
+
Traces may represent recent genetic changes which obscure older changes. Here we have only represented point mutations for simplicity. Actual mutations often insert or delete several residues.
-
GS-K
+
-
+
==Software==
-
GSGK
+
-
</blockquote>
+
* [[Chimera]] - excellent molecular graphics package with support for a wide range of operations
 +
* [[Clustal-W]] - the famous Clustal-W multiple alignment program
 +
* [[Clustal-X]] - provides a window-based user interface to the Clustal-W multiple alignment program
 +
* [[JAligner]] - a Java implementation of biological sequence alignment algorithms
 +
* [[ModView]] - a program to visualize and analyze multiple biomolecule structures and/or sequence alignments
 +
* [[Musca]] - alignment of amino acid or nucleotide sequences; uses pattern discovery
 +
* [[MUSCLE]] - more accurate than [[T-Coffee]], faster than Clustal-W
 +
* [[PhyloDraw]] -  a drawing tool for creating [[phylogenetic tree|phylogenetic trees]]
 +
* [[SAM]] - a collection of flexible software tools for creating, refining, and using linear [[Hidden Markov Model|Hidden Markov Models]] for biological [[sequence analysis]]
 +
* [[SeaView]] - a graphical multiple sequence alignment editor
 +
* [[ShadyBox]] - the first GUI based WYSIWYG multiple sequence alignment drawing program for Major Unix platforms
 +
* [[UGENE]] - a graphical interface for Muscle3, Muscle4, KAlign and Phylip packages. Integrates both multiple alignment and phylogenetic tree editors
-
For obvious reasons I do not represent a silent mutation.
+
==See also==
-
Traces may represent recent genetic changes which obscure older changes. Here I have only represented point mutations for simplicity. Actual mutations often insert or delete several residues.
+
* [[Multiple sequence alignment]]
 +
* [[Sequence Alignment (howto)|Sequence alignment tips]]

Latest revision as of 03:01, 24 November 2010

When two symbolic representations of DNA or protein sequences are arranged next to one another so that their most similar elements are juxtaposed they are said to be aligned. Many bioinformatics tasks depend upon successful alignments. Alignments are conventionally shown as a traces.

In a symbolic sequence each base or residue monomer in each sequence is represented by a letter. The convention is to print the single-letter codes for the constituent monomers in order in a fixed font (from the N-most to C-most end of the protein sequence in question or from 5' to 3' of a nucleic acid molecule). This is based on the assumption that the combined monomers evenly spaced along the single dimension of the molecule's primary structure. From now on we will refer to an alignment of two protein sequences.

Every element in a trace is either a match or a gap. Where a residue in one of two aligned sequences is identical to its counterpart in the other the corresponding amino-acid letter codes in the two sequences are vertically aligned in the trace: a match. When a residue in one sequence seems to have been deleted since the assumed divergence of the sequence from its counterpart, its "absence" is labelled by a dash in the derived sequence. When a residue appears to have been inserted to produce a longer sequence a dash appears opposite in the unaugmented sequence. Since these dashes represent "gaps" in one or other sequence, the action of inserting such spacers is known as gapping.

A deletion in one sequence is symmetric with an insertion in the other. When one sequence is gapped relative to another a deletion in sequence a can be seen as an insertion in sequence b. Indeed, the two types of mutation are referred to together as indels. If we imagine that at some point one of the sequences was identical to its primitive homologue, then a trace can represent the three ways divergence could occur (at that point).

Biological interpretation of an alignment

A trace can represent a substitution:

 AKVAIL
 AKIAIL

A trace can represent a deletion:

 VCGMD
 VCG-D

A trace can represent a insertion:

 GS-K
 GSGK

For obvious reasons we do not represent a silent mutation.

Traces may represent recent genetic changes which obscure older changes. Here we have only represented point mutations for simplicity. Actual mutations often insert or delete several residues.

Software

See also

Personal tools
Namespaces
Variants
Actions
wiki navigation
Toolbox