Alignment Annotator   -   Browser based sequence alignment visualization with JAVASCRIPT

Acknowledgements

Scripting interface

This page describes the scripting interface and explains the scripting language of Alignment Annotator by basic examples. Scripts are used in two situations:

Script line syntax

Each line starts with a script command and may contain parameters. The set of commands is a subset of the commands supported in Strap. Each script line follows one of the following four syntaxes:

  1. Command
  2. Command parameters
  3. Command list-of-sequences-residue-selections
  4. Command parameters, list-of-sequences-residue-selections
    The last comma (,) in the line marks the end of parameters and the beginning of the list of sequences and residue selections.
    Consequently the list of white space separated sequences and residue selections must not contain a comma.

Server side program flow

The input data is processed in the following order:

  1. The sequences in the text-pane "sequences" are loaded (if any).
  2. The script "beforeAC" is interpreted (if any).
  3. The alignment is computed, unless there are already gaps in the input. ClustalW is currently used as standard method. If at this time 3D structures are loaded, mixed sequence/structure alignment is performed. TM-align is currently the standard 3D alignment method.
  4. Residue annotations are loaded from UniProt, CSA or BioDAS-servers if the respective check-boxes are activated..
  5. The script "afterAC" is run. (if any)
  6. Homologous 3D-structures are identified if the check-box "3D" is activated. Since alignment computation is performed earlier, structures inferred here have no impact on alignment computation. Conversely, structures loaded in the text-field "sequences" or inferred by the command project_coordinates in "beforeAC" are considered for alignment computation.

Script   Examples

1. Creating sequences

1.1. Amino acid sequences

beforeAC
Description The command aa_sequence creates amino acid sequences from the one-letter-code.
Display alignment

1.2. Aligned amino acid sequences

beforeAC
Description Dash characters in the sequences are interpreted as alignment gaps. Alignment computation is skipped.
Display alignment

1.3. Nucleotide sequences

beforeAC
Description The nucleotide sequences are defined with the command nt_sequence. The default sequence type is peptide. The command set_alignment_type_CN prevents the nucleotide sequences from being translated.
Display alignment

2. Loading sequences from files

2.1. Sequence files from URLs

beforeAC
Description The command load loads sequences from text documents in any format. It takes URLs, absolute file-paths or database references. URLs or file-paths of Gnu-zipped files must end with ".gz"
Display alignment

2.2. Sequence name

beforeAC
Description The sequence name can be specified by adding a suffix with a vertical bar.
Display alignment

2.3. Sequence files from databases

beforeAC
Description The command load loads sequences from text documents in any format. It takes URLs, absolute file-paths or database references. URLs or file-paths of Gnu-zipped files must end with ".gz"
Display alignment

2.4. Alignment files

beforeAC
Description The command load can load alignments in Clustal, MSF, Multiple Fasta and Stockholm format given as URL or PFAM-ID.
Display alignment

3. Translating nucleotide sequences

3.1. Translating coding sequences

beforeAC
Description The command nt_sequence takes the coding sequence. If the alignment type is set to peptide with set_alignment_type_P at the beginning of the script or if any sequence has other letters than A, C, T, G or N then the sequence is translated to amino acids. The first three letters are the triplet for the first amino acid.
Display alignment

3.2. Genomic sequence positions

beforeAC
Description Optionally, the corresponding genomic positions can be given in the second parameter. They will be reported for the triplet at the mouse pointer and allow for residue selections referring to genomic positions. The command new_nucleotide_selection creates amino acid selections based on nucleotide positions. Please note that oxytocin and vasopressin are transcribed in opposite directions. Move the mouse pointer slowly from the N-terminus towards the C-terminus and observe the genomic DNA positions. Note that the genomic positions are declining in vasopressin while they are rising in oxytocin. Exons 2 and 3 overlap. Considering the triplet at the exon boundary the first codon position belongs to exon 2 while codon position 2 and 3 belong to exon 3.
Display alignment

3.3. Genomic sequences from EMBL or Genbank files

beforeAC
Description The genomic sequence is loaded with the command load. The amino acid sequence is predicted from the exon positions given with the command translate_cds. There is a potential performance problem: The loaded files are potentially huge because they contain the entire gene sequence including UTRs and introns. Therefore, the notion of the above example should be prefered.
Display alignment

4. Pruning

A subrange of the original sequence can be be displayed. Residue numbering still refers to the original full length sequence.

4.1. Displaying a range of residue

beforeAC
Description The residue range can be specified by adding a suffix with an exclamation character. Please observe the residue positions which start with 10.
Display alignment

4.2. Pruning alignments left or right

beforeAC
afterAC
Description The command clip_N_term takes the residue position of one specific reference sequence. If the residue position is omitted, the alignment is truncated at the first residue of that sequence. Nevertheless, it acts on all aligned sequences clipping off all residues left from this position. The command clip_C_term is equivalent and removes residues at the C-terminus.
Display alignment

5. Sequence Groups

Sequence groups are named subsets of all loaded sequences. They can be activated by buttons in the alignment GUI. Open the menu "Sequence Groups" in the alignment frame.

5.1. Sequence groups

beforeAC
afterAC
Description The command sequence_group takes a group name and a list of sequences.
Display alignment

5.2. Sequence groups by taxonomy

beforeAC
afterAC
Description The command taxonomy_group acts on sequences with known taxonomy data. The taxonomy data is obtained in different ways:
  • Set by the command taxonomy
  • Loaded from UniProt formated sequence files.
  • Obtained from UniProt data if the UniProt ID is known.
For each word like "Eukaryota" or "Metazoa" , a sequence group is created. All sequences are included where the respective text string is contained in the taxonomy data. Open the menu "Sequence Groups" and find "Vertebrata #2", "Eukaryota #4" and "Mammalia #1".
Display alignment

6. Shorter scripts - compactness

Using variables, brace expansion and regular expressions, readability of the script can be improved and its size reduced.

6.1. Variables

beforeAC
Description A variable is defined with the command let. Subsequent script lines can contain references to that variable. When a script line is run, all variable references are replaced by their assigned values. As for UNIX shells, there are two notions for variable references:
  • dollar_name like $msg
  • dollar_curly-brace_name_curly-brace like ${msg}.
    This notion is required for variable references followed directly by a word character (letter, digit and underscore).
Display alignment

6.2. Brace expansion

beforeAC
Description Brace expansion is described in Wikipedia The elements surrounded by curly braces are here separated by white space.
Display alignment

6.3. Regular expressions and asterisks

beforeAC
Description Asterisk: An asterisk stands for all sequences. A sequence name followed by slash and asterisk represents all residue selections in that sequence. An asterisk followed by slash and a selection name stands for all residue selections of that name in any sequence. Nevertheless, asterisks in regular expressions have a different meaning.
Regular expressions: Regular expressions can be used for lists of sequences or residue selections. The expression HBA_.* denotes all sequences with a name starting with HBA_. The expression */Histidine.* means all residue selections in all proteins with a name starting with Histidine
Display alignment

6.4. Aliases for web adresses

beforeAC
Description Without aliases, a web link within the balloon text of sequences and residue selections would require the complicated HTML syntax "<A target="_blank" href="...">...</A>. Aliases make it easier and also allow multiple usage of the same URL or its constant part. The alias for the entire web address or for its constant part is defined with alias_for_url. In this example the alias "WIKT:" is defined. It stands for "http://en.wiktionary.org/wiki/". This URL is not complete. Therefore "WIKT:" is followed by a word, here "happy". Thus "WIKT:happy" stands for "http://en.wiktionary.org/wiki/happy" The alias can be used with the commands balloon_text, add_annotation Balloon and GFF. Open the example alignment and evoke the balloon text in the example alignment by moving the mouse over the sequence name "seq1". Then click with the right mouse-button which will bring up a web link.
Display alignment

7. Sequence Attributes

7.1. Balloon messages

beforeAC
Description The command balloon_text takes plain text or HTML code. It can contain web links and database references. These references act as web-links. On desktop computers, the user needs to perform a right-click in order to activate the web-links. For mobile devices this is not necessary. Also see alias_for_url.
Display alignment

7.2. Accession IDs and cross references

beforeAC
Description Database IDs are either set explicitly with the commands accession_id, add_xref and balloon_text or predicted by sequence search with the command find_uniprot_id.
In Alignment Annotator, the UniProt ID is very important and is used for uniprot_features, DAS_features and taxonomy_group.
Display alignment

7.3. Icons

beforeAC
Description The command icon takes gif, jpg and png images as URL or base64 data. File paths of files on the server can also be used.
Display alignment

7.4. Residue index offsets

beforeAC
Description The command set_residue_index_offset is used when residue numbering start not at number one. In this example the displayed residue range starts at 10 such that the first index is 10 plus 30 = 40. The offset affects the position of residue selections.
Display alignment

7.5. Secondary structure

beforeAC
Description If the secondary structure elements are not recorded in the 3D structure file, the secondary structure is computed by dssp. Helices are drawn red and beta sheets yellow.
Display alignment

7.6. Secondary structure

beforeAC
Description Residue structure elements can be assigned with the command secondary_structure. E=extended sheet, H=helix. To see the secondary structure for each individual sequence, the check-box "Helices & Sheets" in the tool-bar of the alignment needs to be activated. If more than one sequence has secondary structure information, the one that spans most alignment positions is taken for the secondary structure cartoon. This can be changed with set_ruler_secondary_structure.
Display alignment

8. Residue selections and annotations

Residue selections are displayed by underline or filled background. They can be defined explicitly, refering to the sequence index, the PDB resnum and insertion code or the nucleotide index of the DNA sequence an amino acid was predicted from. They can also be obtained from annotation databases. Residue selections can have attributes like color, balloon messages and 3D-commands.

8.1. GFF-notion

beforeAC
Description The command GFF takes GFF-formated annotations. Fields are separated by tabulator character or vertical bar. For more details open Change > Annotations > Own in Alignment Annotator.
Display alignment

8.2. GFF: with attributes

beforeAC
Description The 9th field can contain attributes such as Balloon. The attribute text ends at the next semicolon. If the attribute text itself contains a semicolon, then the text needs to be surrounded by double quotes.
Display alignment

8.3. GFF: non-consecutive positions

beforeAC
Description When the 5th field (End-position) is omitted, the 4th field is interpreted as an expression of positions. It can contain comma separated ranges and single positions such as 10-20,40-100,102 This is not standard GFF.
Display alignment

8.4. GFF: refering to PDB-Resnum

beforeAC
Description When the 5th field (End-position) is empty, the 4th field can contain a complex specification of residue positions consisting of single positions and intervalls separated by space. For the PDB residue positions of 3D structure files, the Rasmol notion is used: PDB residue number - colon - (optional) chain letter. Thus the colon indicates that the number refers to the PDB numbering rather than the natural numbering. Rarely, adjacent residues share the same residue number and are distinguished by the so-called insertion code. This is a single upper case letter between the residue number and the colon. There is another deviation from the natural numbering: Zero and negative positions (not PDB residue numbers) are displayed as one minus the number. This is because the number zerow is usually ommitted.
Display alignment

8.5. Adding attributes to residue selections

beforeAC
Description Residue selections can also be created with new_selection. To address positions of the underlying nucleotide sequence of an amino acid sequence, new_nucleotide_selection is used instead. Information can be attached with add_annotation or set_annotation. Compared to the GFF-command, this notion is much more verbose. Variables are beneficial, see let.
Display alignment

8.6. Residues in proximity to a ligand

beforeAC
Description With the attribute "AROUND" for the command new_selection, residues in proximity to a ligand (Hetero or DNA/RNA) are selected. Allowed constructs are "AROUND=DNA", "AROUND=RNA" and "AROUND=NucleotideChainLetter"
Display alignment

8.7. Solvent Accessibility

beforeAC
Description With the attribute "MIN_ACCESSIBILITY=..." for the command new_selection, residues are highlighted that have a solvent accessible surface area greater than the given value in square Angstrom. Computation is performed with the program mkdssp by Kabsch and Sander which must be in the executable path. Only sequences with known 3D-structure (see project_coordinates) are considered. Alignment Annotator expects mkdssp in /usr/bin/ or bin/.
  • example_1: Only amino acids are considered - hetero atoms and nucleotide acid are ignored. In multimeric protein (here proteasome), also the amino acids at the interfaces between the subunits are highlighted even though they are not solvent exposed in the multimer.
  • To exclude those residues at the inter-subunit interfaces, the parameter SUBUNITS can hold a reference to other structure files to be considered during computation.
    • example_2: For proteins loaded from the PDB, the attribute SUBUNITS=ALL denote the original structure containing all subunits.
    • example_3: A file with all other subunits can be provided.
    • example_4: The two neighbouring subunits are given as a PDB reference. Lists of space separated entries must be enclosed in double quotes.
For structures from the PDB, subunits=ALL can be used.
Display alignment

8.8. Residue selections from UniProt

beforeAC
Description The UniProt ID is obtained by sequence search. Alternatively, it can be set explicitly with add_xref or accession_id. The command uniprot_features highlights all sequence features stored in the UniProt. Since the data is available on the Alignment Annotator server, UniProt features are loaded instantly.

BioDAS annotations used to be loaded with the command DAS_features. Unfortunately, most BioDAS servers and the registry is not available any more. Additional BioDAS registries can be added by the administrator.
Display alignment

9. Display

9.1. Residue color

beforeAC
Description The color mode is specified with the command set_color_mode. A positive percentage value for set_conservation_threshold highlights conserved positions and a negative value divers positions.
Display alignment

9.2. Residue background color

beforeAC
Description The command bg_color specifies the background color of individual residues. PDB residue numbers can be refered to by appending a colon to the position number.
Display alignment

9.3. Alignment title

beforeAC
Description The command title sets the document title. The window title "Frog and pig" can only by observed if the alignment view is opened in a tab or window of its own. In most browsers you can open the link either with Ctrl-left-click or right-click.
Display alignment

9.4. Characters per line

beforeAC
Description With the command set_characters_per_line the number of characters (gaps plus residues) per line can be specified.
Display alignment

10. 3D-Visualization

Currently, 3D-visualization is based on Java, but a JavaScript based 3D-visualization (probably JSmol) and a desktop application will be included soon. The type of 3D viewer does not affect the scripting language which is independent on the specific implementation.

First install Java. Use a web browser that still supports Java applets: Firefox, Iceweasel, Opera and IE. On the other hand, MS-Edge, Chromium and Chrome do not support Java applets.

The 3D views are not shown automatically, when the alignment document is displayed in the browser because there might be several 3D-views. Each is represented by a button. 3D views are displayed by pushing the respective button. There are two different locations for these buttons: 3D-views are created with the command open_3D which takes the unique ID and a list of loaded proteins or structures which are not part of the sequence alignment. The later are given as file paths, URLs or database reference. To be recognized as file paths, file paths must start with slash, dot-slash or dot-dot-slash. A 3D-view can be referred to by its ID select_3D and one ore several of the loaded proteins or structures. Proteins are best specified by their sequence name and pdb files must be referred to by exactly the same file path, URL or database reference used in the open_3D-command. Once selected with select_3D, 3D script commands can be applied. These commands start with "3D_". Usually, the next command is 3D_select

10.1. Superimposing protein structures

beforeAC
Description The command superimpose superimposes some protein structures. Sequences without attached 3D-coordinates are ignored. The program determines the optimal reference structure. All structures are superimposed upon the reference structures. Start the 3D-applet from the context-menu of the sequence name (Right-click). Alternatively, go to menu "3D" and open the 3D-view "View_of_three_chains".
Display alignment

10.2. Style of single atoms

beforeAC
Description Atoms are selected with the command 3D_select.
The following style commands are available:3D_cartoon   3D_dots   3D_label   3D_lines   3D_mesh   3D_sa_surface   3D_spheres   3D_sticks   3D_surface   3D_color   3D_ribbons  
Further 3D commands are: 3D_render   3D_center   3D_center_amino   3D_object_delete   3D_rotate   3D_script_panel   3D_select   3D_selection_name   3D_zoom  
Display alignment

10.3. Attaching 3D styles to residue selections

beforeAC
Description Another method for changing 3D-styles is to create a residue selection and to attach annotations of the type "3D_view.
Display alignment

10.4. Residue annotations from UniProt

beforeAC
Description With the command add_annotation, 3D-styles are attached to residue annotations with the Name "Active_site" loaded from Uniprot. All entries of type "3D_view" are evaluated one after the other. Initially all atoms of the amino acids are considered. The current set of atoms is altered with a command like "add_annotation Atoms=.CA, residue-selection".

This example exhibits are a rare problem: Why are the Uniprot annotations not shown on PDB:1SBC_A? Because the PDB sequence differes from the UniProt as indicated by entries in the PDB file like
SEQADV 1SBC SER A  103  UNP  P00780    THR   207 CONFLICT
Perhaps in later releases of Alignment Annotator, mismatches indicated in this way may be tolerated and the UniProt annotations loaded.
Display alignment

10.5. Multimeric proteins

beforeAC
Description Users might want to answer the question whether a specific amino acid is in proximity to another subunit of a multimeric protein. The 3D view can contain molecules or subunits that are not in the sequence alignment. For this purpose the command open_3D does not only take references to sequences in the alignment but also:
  • PDB reference as demonstrated in this case
  • Paths of structure files. They must start with slash, dot-slash or dot-dot-slash.
  • URLs
In the current example PDB:1RYP_B is part of the alignment while PDB:1RYP_A PDB:1RYP_C are references to PDB-chains. Nevertheless, also the molecules that are not part of the alignment can be displayed three-dimensionally and the display style of single residues or atoms can be altered.
Display alignment

10.6. DNA

beforeAC
Description This is a leucine zipper transcription factor. Chains E to J are peptide chains. Chains A, B, C and D contain DNA-chains. Limitation: currently it is not possible to select specific nucleotides.
Display alignment

11. Structure alignment

11.1. Sequence or structure based alignment

beforeAC
afterAC
Description Here, the command project_coordinates is run before alignment computation. Therefore the 3D coordinates of Cα atoms are used for alignment computation. For comparison, move the command line "project_coordinates..." to the second script text box and observe the alignment of the active site residue Ser129. Since the sequence similarity of these remote homologs is low, the alignment quality obtained by ClustalW is poor. This can be seen by the active site trias which is aligned only if the 3D structure is used. Another indicator are the secondary structure elements which are displayed with a check-box in the tool-bar. Since insertions and deletions hardly occur in helices and beta sheets, they should be almost devoid of gaps.
Display alignment

12. Abnormal computation

12.1. Abnormal program termination

beforeAC
Description There are two reasons why computation can be terminated prematurely:
  • Another job has been queued while computation time exceeded a maximum. In this case the current job gets killed to allow the new job to be executed.
  • Technical problems, programming errors, server failure.
This example simulates abnormal program termination due to technical problems or programming errors in Alignment Annotator. The user can modify the program parameter and script lines in the hope that the error does not occur on re-submission. In this example, you can open the the section Change - Script and remote the command "die" from the script. Then re-start computation to obtain the alignment. For administrators: Ctrl-left-click into the parent page of the alignment opens the debug panel.
Display alignment

12.2. Very long computation

beforeAC
Description Under certain conditions (here a sleep command), Alignment Annotator may run for a very long time. Besides technical server problems, possible reasons are
  • Time consuming alignment computation and 3D-superposition
  • Server is very busy
  • Large data is loaded from remote computers
  • Remote computers neither send a result nor an HTTP error code or answer with delay.
In this case the user can stop the computation and modify the script before resubmitting the job. For testing, edit the section Scripts of the alignment view and remove the sleep command. After clicking submit the alignment appears.
Display alignment

12.3. Exceptions

beforeAC
Description In case of programming errors, so-called exceptions such as NullPointerException or IndexOutOfBoundsException occur and the program may not be able to produce a result. Adding the parameter "true" to the command, simulates an exception that is caught.
Display alignment
Navigation bar
Creating sequences Amino acid sequences Aligned amino acid sequences Nucleotide sequences Loading sequences from files Sequence files from URLs Sequence name Sequence files from databases Alignment files Translating nucleotide sequences Translating coding sequences Genomic sequence positions Genomic sequences from EMBL or Genbank files Pruning Displaying a range of residue Pruning alignments left or right Sequence Groups Sequence groups Sequence groups by taxonomy Shorter scripts - compactness Variables Brace expansion Regular expressions and asterisks Aliases for web adresses Sequence Attributes Balloon messages Accession IDs and cross references Icons Residue index offsets Secondary structure Secondary structure Residue selections and annotations GFF-notion GFF: with attributes GFF: non-consecutive positions GFF: refering to PDB-Resnum Adding attributes to residue selections Residues in proximity to a ligand Solvent Accessibility Residue selections from UniProt Display Residue color Residue background color Alignment title Characters per line 3D-Visualization Superimposing protein structures Style of single atoms Attaching 3D styles to residue selections Residue annotations from UniProt Multimeric proteins DNA Structure alignment Sequence or structure based alignment Abnormal computation Abnormal program termination Very long computation Exceptions