Forming Web-links for protein files and alignments

Introduction

This page describes how Web-links for the protein and alignment viewer STRAP are formed. Clicking these Web-links in a Web-browser opens amino acid sequences, protein structures or multiple sequence files in STRAP.

Client side Requirements: This requires that Java is installed on the client PC. On Macintosh Java is usually already installed but on Windows or Linux the user needs to install Java. Server side Requirements: There are no particular requirements for the Web-server, as the links are static links. Therefore links can be included in any Web-page.

Application: This technique provides an efficient way to present scientific results. Amino acid sequences, protein alignments and protein structures which are relevant for specific research projects can be exhibited on a Web page. Certain residues such as mutations and site directed mutagenesis sites can be highlighted in the alignment and in 3D. These Web-links can also be included in computer programs for Systems Biology, office documents and PDF-documents to efficiently document and communicate alignments and 3D-structures. Scientists can send Emails to colleagues and collaborators containing these Web-links. The recipient can open the same alignment with all residue highlightings. Finally, these Web links privide a way by which Bioinformatics databases can use STRAP as a viewer for proteins and alignments.

Advantages

In Web pages, alignments are often represented as static documents (.html, .pdf, .doc, .rtf) or shown dynamically with Browser embedded Java applets. All information needed to display the alignment resides on the server. Other servers are not involved when the alignment is displayed on the client computer.

Here, a different approach is suggested: Encoding only the protein references in a Web to allow loading of up-to-date versions of the protein files from the original protein databases. This has the following advantages:

Limitations

A Web-address must not exceed a critical size. Therefore, an alignment with hundreds of entries or extensive sripting commands could not be encoded in a Web address. For Web-pages there is a convenient workaround: Large information can be included in so-called forms (See Using script commands). But for Emails, Office documents etc. the size limitation of an URL can be a problem.

Automated generation of the Web-address

STRAP assists the generation of the Web links. With the STRAP dialog Publish alignment in Web pages in the menu "File" users can automatically generate the Web-address (URL). Press F10 if the menu-bar is hidden. Knowledge of the syntax described in this page is not required.

Web-variables in the URL

The variables load= or align= or alignAndRearange= contain one or several protein references in form of URLs or in the form Database-colon-ID. Unless loaded with the variable "load=" proteins will be aligned after loading. For this purpose the 3D-superposition program TM-align and the sequence alignment program ClustalW are combined. With "alignAndRearange=" the proteins are reordered according to sequence similarity.

Additional web variables provide further options:

Additional information for the protein entries

The database reference or URL of the protein can optionally be followed fields separated by "|" (vertical bar). Note: The percent encoding of "|" is %7C. Strictly speaking, "|" should be written as %7C in URLs. But apparently Web browsers tolerate if "|" is not properly encoded.

Uniprot Examples

Complex Uniprot Expressions

GenomeNet (Kegg) Examples

Entrez Examples

EMBL or Genbank nucleotide example

EMBL and Genbank files have a nucleotide sequence block. Coding sequences (CDS) are defined by an enumeration of nucleotide positions of the form
FT   CDS             join(25240..25717,29079..29174,31348..31417,39382..39809,
or in case of reverse complement
FT   CDS             complement(5226515..5227132)
This expression is used to compute the amino acid sequence. The following examplifies how this expression can be changed or how the n-th CDS can be selected.

Ensembl (under reconstruction)

PDB-Examples

Proteins with nucleic acid:

Setting the biological unit:

The matrices which are applied to the can be specified in the 5th field in form of a bit-mask given as a hexadecimal number. For example 8 means the 4th matrix as the binary representation of 0x8 is 00000001000. Minus 1 denotes the asymmetric unit.
-1     all matrices     1(wrong, not existing)     2     3     4     8     10     20     40     10000(wrong, not existing)    

Hetero-Compounds, DNA, RNA:

PDB files often contain non-peptide structures such as flavine or NADH and DNA/RNA structures which are treated in the following way: Those hetero compounds that share the chain identifier together with a peptide are added to the respective peptide object. This will be indicated by a vertical green (nucleotide) and red (heteros) bar of the protein labels in the alignment row headers. But if the hetero compound has a chain of its own, then things are more complicated:

SCOP- Examples

PFAM Examples

Prodom Examples

HSSP Examples


Example with direct Web address

Instead of refering to a protein by database-colon-accession-ID, a crude Web address of the Protein file can be used. Special characters of the URL like the two slashes in "http://" must be percent encoded.

Automatically inferring database knowledge

The standard alignment file formats MSF and ClustalW contain the protein names and the aligned sequences. Further information for each protein like author, journal, X-references and sequence features are not provided by these files. STRAP can download the original protein files from the databases to infer this information. Knowing the database identifier and the protein ID, STRAP can download and visualize sequence features which are listed in the DAS registry. For this to work STRAP needs to know the name of the database and the ID for each protein. The alignment in ClustalW format must therefore have a certain format: The row header contain the database identifier like PDB:, UNIPROT: or NCBI: followed by a colon and the protein ID. In the following example the alignment file proteases.aln will be referenced in the STRAP URL.
CLUSTAL W 2.1 multiple sequence alignment														
																																																							
UNIPROT:P08490 PKVPTLRQAKVQGPAFEFAVAMMKRNASTVKTEY---GEF
UNIPROT:P03313 PRVPTLRQAKVQGPAFEFAVAMMKRNSSTVKTEY---GEF
PDB:1cqq       ------------GPEEEFGMSLIKHNSCVITTEN---GKF
PDB:1hj9       ---IVGGYTCGA--------NTVP-YQVSLNSGY---HFC
UNIPROT:P35030 DDKIVGGYTCEE--------NSLP-YQVSLNSGS---HFC
UNIPROT:O60259 EDKVLGGHECQP--------HSQPWQAALFQGQQ---LLC
PDB:1wyk_A     ------------------------MRLFDVKNED-GDVIG     
UNIPROT:Q1VSD2 NSSGPCNQDVDCPIGSDFDNLKEELKKSVAMTIVGSSGFC
PDB:1arb       GVSGSCNIDVVC-PEGDGRRDIIR-AVGAYSKSG--TLAC      
UNIPROT:A9AV62 DKSGSCNVDVVCPEGDDWRAEINS--VAAYTRNG--LDMC
      

View alignment in STRAP
The Web-address is
 strap.php?strap.php?rename=sp&downloadOriginalProteins=t&separateInstance=t&dasFeatures=CSA+-+extended%7Cuniprot&load=http%3A%2F%2Fwww....proteases.aln 

The address contains several parameters:

Technical details

The client computer needs Java version 1.5 or higher. The links in this document point to a jnlp file. The jnlp file must be opened by the browser with the program bin/javaws which is part of the Java system. Occasionally, browsers fail to locate this program. In these cases the user needs to find the location of javaws on the hard-disk. See Browser settings.

External applications

Sequista cannot be used in Strap-Lite version yet.

What happens in case of download errors:

Occasional, protein entries are removed from databases and are not available any more. What happens if STRAP tries to download a non-existing file or when the server is not responding? The result depends on the server response. Here some examples of non-existing entries:

Time consuming alignments:

Frame size and location

The location of the application frame is specified with the option geometry=width x heigth + offsetX + offsetY following the .
  1. geometry=400x300+100+100 This means width=400 height=300, Position at pixel 100,100.
  2. geometry=400x300+200+100
  3. geometry=500x300+100+100
  4. geometry=500x300-100+100 Negative offsetX refers to the right screen margin.

Related resources: