Visualization does not require client side Java Visualization does not require client side Java

Alignments in HTML from the command line


Interactive example of overlapping residue annotations. The first sequence has two residue selections indicated by cyan and red background. The second sequence exhibits two residue selections which are shown as red and green underlining. The text information pops up when the mouse is moved.

This page describes generation of alignment documents with commands from the UNIX shell. Please note that there is an interactive web service with an API and that the graphical Java program Strap also provides HTML export.

Reproducing the example

Download:
      BASE=http://www.bioinformatics.org/strap
      FILE=$BASE/strap.jar
      wget -N $FILE || curl -O $FILE
      FILE=$BASE/scripts/toHTML1.txt
      wget -N $FILE || curl -O $FILE
      FILE=$BASE/toHTML/data/fly_temp.gif
      wget -N $FILE || curl -O $FILE
      FILE=$BASE/aa/alignment2html.jar
      wget -N $FILE || curl -O $FILE
    
        export JavaProxy=' -DproxyHost=proxy.institution.org -DproxyPort=8080 -Dhttp.nonProxyHosts="" '
      
Test the proxy settings. You should see google's html code.
        java $JavaProxy -jar strap.jar -testWeb http://www.google.com
      

Create the HTML alignment in the figure:
 java  $JavaProxy  -jar strap.jar -script=toHTML1.txt  -toHTML=myOutput.html
The output file myOutput.html is ready to be displayed in a web browser.

alignment2html.jar is faster than strap.jar

The program strap.jar is a command line tool and a desktop application. In contrast, alignment2html.jar which is built from the same source is lighter and faster because it does not include the GUI classes. Furthermore MS-Windows support and software installation at run-time is deactivated. Another difference is that for similarity search, it uses local Blat which is much fater than Blast with the consequenence that the databases need to be installed locally. The manual is displayed with the option -help
      java -jar alignment2html.jar -help 
    

The script file

The lines in the script file toHTML1.txt are sequentially executed. Lines, which start with a hash character are ignored. An alphabetic list of all commands is printed with the command line option
-help=script
. Also see Scripting language. The following three lines specify the number of characters per line, the minimal conservation of a residue position to be emphasized in bold face and the residue color mode.
      set_characters_per_line 24
      set_conservation_threshold 70
      set_color_mode chemical 
Three sequences are created: Canus and Xenopus and Drosophila. Dashes denote alignment gaps. Dashes would be not not required if the alignment was computed with the command align *.
      aa_sequence MVLSAADKGNVKAAWGKVGGHAAEYGAEALERMFLSFPTTKTYFP, Canus
      aa_sequence -VLSAAERAQVKAAWGKI--QAGAHGAEALERMFLGFPTTKTYPF, Xenopus
      aa_sequence MILSAAERAQIKAAWGKVG-NAGAHGAEALD--FLGYPTTKSYPY, Drosophila 
Assigning the cleaved initial Methionine the index 1, the Xenupus sequence starts with amino acid number 2:
 set_residue_index_offset 1, Xenopus
The protein image icons are shown in the alignment row header. For the first two icons the URL is given. The last icon is loaded from a local file which is not accessible from other computers and the image data is included into the HTML file:
      icon  http://www.goldenweb.it/software/immagini/icone/animals/water_animals/Frog.gif, Xenopus
      icon  http://www.goldenweb.it/software/immagini/icone/animals/misc_animals/dog1.gif,  Canus
      icon fly_temp.gif, Drosophila
If the database accession ID is given, a blue asterisk after the sequence name acts as a hyper-link:
      accession_id  UNIPROT:P0A7B8 , Canus
Residue selections are created with the command new_selection. Two display styles are supported: STYLE_BACKGROUND and STYLE_UNDERLINE. Color, display style, balloon-text and web-links are specified with add_annotation or set_annotation.
      new_selection  1-4,                                               Canus/N-terminus 
      set_annotation Hyperrefs=http://en.wikipedia.org/wiki/N-terminus, Canus/N-terminus 
      set_annotation Style=STYLE_BACKGROUND,                            Canus/N-terminus 
      set_annotation Color=#00ffFF,                                     Canus/N-terminus 
      add_annotation Balloon=Balloon text blablabla,                    Canus/N-terminus 
    
The command set_annotation overrides any previous value, whereas add_annotation keeps already existing lines.

A description of all commands is obtained with the program parameter -help=script.

Splice variants of Hexokinase. The size of the alignment exceeds the window size and therefore it can be scrolled. These sequences are loaded from nucleotide sequence files. Therefore the coding triplet and exon number of the amino acid under the mouse pointer is shown.

3D-Visualization

Java-applets (OpenAstex) for 3D visualization are included automatically if 3D-coordinates are provided. Both views are linked: Clicking an aminoacid in the alignment or the 3D-view will highlight the respective residue in the other view. If proteins are loaded from files in PDB format then 3D-coordinates are taken directly from that file. Otherwise, a PDB model of an identical or at least homologous protein can be manually or automatically associated. The command project_coordinates takes either the PDB identifier in the form PDB:1sbc or PDB:1sbc_A (chain A) or an URL of a (compressed or uncompressed) protein file or the keyword AUTO. Residue mismatches between the sequence and the 3D-model are optionally shown. The following command will automatically identify homologous structures for all proteins (asterisk) using BLAST.
project_coordinates AUTO, *


3D-styles

Rendering Styles of Atoms in the 3D-visualization can be changed in two alternative ways: These 3D-commands are independent of the 3D software, currently OpenAstex. They will still be valid, even if another 3D-view will be supported in the future.

Specifying sequences

In the example, sequences are defined with the command aa_sequence, which accepts amino acid sequences with or without gaps. Alternatively, a local or remote file or database entry can be loaded.
Example with database references:
load UNIPROT:P49722 UNIPROT:P0A272 PDB:1ryp_C
Example with URLs:
load http://www.bioinformatics.org/strap/dataFiles/hs_HelicobacterPylori.swiss http://www.bioinformatics.org/strap/dataFiles/hs_SalmonellaTyphi.swiss
A subsequence rather than the entire amino acid sequence may be anticipated. The residue index intervall is appended after an exclamation mark. One of both intervall boundaries can be omitted. Example:
load UNIPROT:P49722!30-60
Optionally, a protein name can be given after a vertical bar. Example:
load PDB:1ryp_C|My_name
The name can contain the following variables: $ORGANISM, $ORGANISM_SCIENTIFIC, $ORGANISM5 (E.g. "DroMe" for Drosophila melanogaster), $NAME (The original name), $PDB, $SP (Swissprot name like "hslv_ecoli") and $SP1 (First part of Swissprot name like "hslv").

Nucleotide sequences are translated to amino acid sequences knowing the strand orientation and exon boundaries. This information is either contained in EMBL or Genbank formated nucleotide files or is given with the command cds. This is demonstrated below and explained in Scripting language.

Alignment computation

In the above example, a precomputed multiple sequences alignment is directly defined with aa_sequence. Alternatively, the alignment can be computed with the align command:
align *
The wildcard "*" or ".*" means all sequences. Alternatively, a space separated list of protein names, database IDs and regular expressions matching protein names will be accepted. By default ClustalW (Precompiled binary for Intel) and CE/CL (Java) will be used. The 3D-alignment program TM-align (Fortran) is faster than CE/CL. You could install TM-align from the software manager of your computer. Under Debian:
 apt-get install clustalw
    tm-align 
Alternatively install a Fortran compiler. Then add the program option
-a3d=tm_align
. There are a few alternatives to ClustalW, some of which produce more accurate results but require more time. They will be expected in the /usr/bin/ directory for example /usr/bin/t_coffee. They can also be automatically loaded and installed. The unattended software installation from source code requires the software installation tools make and C++.

BioDAS annotations

Annotations are loaded for all sequences (Asterisk) or for a list of sequences with a command like
DAS_features CSA%20-%20extended uniprot cbs_total netphos netoglyc , *
and the GFF features from the Expasy server are loaded with
GFF_expasy_features *
The "%20" in the feature name is the hexadecimal character code for white space. After loading the data from the remote servers, the sequence positions are underlined in the alignment. The DAS-annotation providers are listed in the standard BioDAS registry file or in supplementary registry files given at the command line. Underlining these sequence annotations is time consuming. At least the identification of the UNIPROT identifier, can be accellerated by a local BLAST database and a local Uniprot as described below.

Program features by examples

Loading / creating sequences

3D

3D-views are automatically included in the HTML output for all sequences with 3D-coordinates Additional 3D views can be defined with the command open_3D.

Annotated residue selections

Nucleotide sequences

Unless the amino acid sequence is explicitely provided either with the command aa_sequence or in the field "/translation=" of a Genbank or Embl formated file, the amino acid sequence is predicted using the default genetic code. In rare cases the prediction will be wrong due to a different genetic code ( Stop-codon instead of Tryptophane) or mRNA editing.

Annotation services

Sequence features are a certain type of residue selections. In the html output the respective sequence positions are underlined with a color specific for the feature name. They can be shown and hidden with check-boxes. Sequence features are loaded from external services or created explicitely in the script file.

Sequence groups

Sequence groups are named sets of sequences. Each sequence group has a button to select or deselect the respective sequences.

Generating all examples

If strap.jar is downloaded and the web proxy is written to the variable JavaProxy then all examples in this page can be generated. The program keeps data in $HOME/.StrapAlign and will therefore run much faster next time.
 for i in ; do 
      FILE=http://www.bioinformatics.org/strap/toHTML/scripts/$i.txt
      wget -N $FILE || curl -O $FILE
    java  $JavaProxy  -jar strap.jar -script=$i.txt  -toHTML=$i.html || break
 done

Contact

christophgil  ät goog lemail .com