STRAP-toolbox for

Proteins and Alignments

Using the building blocks of STRAP for
Self designed programs and scripts

Change in the API: The variable type for sequences changed from java.lang.String to byte[] in SequenceAligner.

Introduction

STRAP-toolbox is a Java-toolkit for the design of Bioinformatics programs and scripts. It is derived from the multiple sequence and structure alignment program STRAP. STRAP uses Java, a fast compiled computer language which is widely used in Bioinformatics. Therefore STRAP works together with other Bioinformatics libraries written in Java. Nevertheless, Programs written in C/C++ such as Pymol and ClustalW can also be used within STRAP.

To provide an overview, all scripting examples are concatenated to one single file.

Installation

Install JDK from Sun (60 MBytes) or IBM http://java.sun.com/javase/downloads/). You need the plain JDK and not "JDK with ... " The JDK contains four important commands: All following installation steps may be comfortably performed by copying the green on black installation script into the command shell. On some UNIX systems the default shell is the C-shell rather than bash or ksh. In this case you can change to bash (or Korn-shell) by typing "bash" or "ksh" at the command prompt.

Installation of STRAP-toolbox on Windows

After installing the JDK you should check whether the command javac.exe is available in the command shell. Otherwise add C:\Programme\Java\jdk...\bin to the environment variable PATH. German: "Systemsteuerung" ==> "System" ==> tab "Erweitert" ==> button "Umgebungsvariablen". English: "My computer" ==> "Property" ==> "System" ==> tab "Advanced" ==> button "Environment variables". You can install Cygwin which contains the bash and wget: setup.exe for Cygwin. If you prefer the MS-DOS command interpreter you need to adapt the installation script since it has a different syntax. During the installation of Cygwin additional software packages can be specified. It is necessary to activate wget which is found in the Cygwin group "Web". SETUP.EXE can be run any time to add or remove software packages. After installation, the Cygwin icon appears on the desktop. Double click it, wait 2 s and press enter to get into the command shell. To copy and paste the installation script lines (green on black) you need the of the shell window (right-click).

The first script line sets the variable DOWNLOAD with the program for downloading files, usually wget or curl.
The variable PATHSEP is semicolon on Windows, otherwise colon. If you are in an Intranet it might be necessary to set the Web-proxy with the -D command line option.
export http_proxy=proxy.mySite.com:888
alias java="java -Dhttp.proxyHost=proxy.mySite.com -Dhttp.proxyPort=888 "
      


To install the library strap.jar and all examples you need to open a command shell. On Windows open a Cygwin shell and on Macintosh open Utilities/Terminal.

Then copy and paste the following green on black text.
 # Specify the program you use to fetch files from Internet
 DOWNLOAD="wget -N" 
 wget --version || DOWNLOAD="curl -O"

 # The path separator character is usually a colon
 PATHSEP=":"
 if [[ "$OSTYPE" == cygwin ]]; then PATHSEP=";"; fi

 # Create a test Directory
 mkdir $HOME/testSTRAP_Scripting
 cd $HOME/testSTRAP_Scripting

 # Fetch the examples and the strap.jar. Alternatively take the browser
 $DOWNLOAD http://www.bioinformatics.org/strap/strapScript/allExamples.jar
 $DOWNLOAD http://www.bioinformatics.org/strap/strap/strap.jar 

 # Extract the zipped archive, which contains the examples
 jar -xf allExamples.jar

 # Adjust classpath
 export CLASSPATH=.$PATHSEP"strap.jar"$PATHSEP$(echo  biojava-live/*.jar | tr ' ' "$PATHSEP")$PATHSEP

      

Starting the examples

In the following all demos are listed. You can copy each line to the command shell to start the corresponding example. Each line consists of two commands: javac which compiles the example and the java which runs the example.

Editing the examples

Windows does not come with a full featured text editor. A good choice for Windows is Notepad++. You may also try J_Java_Editor.
java charite.christo.J_JavaEditor    DemoViewProteinBackbone.java &
      

The building blocks

The following lists the important elements, which can be used in scripts or programs.

SequenceAligner

A SequenceAligner aligns two or more sequences automatically.

DemoSequenceAligner1.java aligns two sequences. The two sequences are given as java.lang.String objects. The computed alignment is provided as String[] array containing the gapped sequences.

SequenceBlaster

A SequenceBlaster runs the program BLAST. which is an algorithm for comparing biological sequences. Given a collection of sequences, a BLAST search enables a researcher to look for sequences that resemble a given sequence of interest. The query sequence is provided as a String object.

DemoSequenceBlaster.java identifies all sequences in the PDB-Seq database that are similar to a given amino acid sequence. The result is an XML text. The XML text is transformed into a structure by BlastParser . This structure is then written in the human readable form. Web-BLAST takes some time to compute. To avoid repeated BLAST runs with the same query a cache is maintained.

PredictorSecondaryStructure

A SecondaryStructure_Predictor predicts the secondary structure from amino acid sequences. It takes the query sequence from a String. Some Web-services process many amino-acid sequences at once. Rather than contacting the server several times one server job is submitted with many sequences. Therefore the class takes not a sequence in form of a String but several sequences in form of a string array using the method SecondaryStructure_Predictor#setGappedSequences(java.lang.String[]) . Two other interfaces are used in the same way like SecondaryStructure_Predictor: TransmembraneHelix_Predictor and CoiledCoil_Predictor

DemoPrediction.java shows how the secondary structure can be predicted from an amino-acid sequences. The secondary structure is obtained with the method getPrediction() which returns a characer array for each input sequence. The characer 'H' denotes helical residues whereas 'E' stands for extended or sheets.

StrapProtein

StrapProtein is the basic class for proteins. The contained sequence might contain gaps. 3D-information of the protein structure may be contained as well.

DemoProtein_aminoAcids.java shows how a protein object is created with StrapProtein#newInstance(File) . The amino acid sequence is set by StrapProtein#setResidueType(String) and the amino acid sequence is retrieved with StrapProtein#getResidueTypeAsString() .

DemoProtein_nucleotides.java shows how an amino acid sequence can also be defined indirectly by its coding nucleotide sequence using StrapProtein#setNucleotides(String,String) .

DemoProtein_gaps.java shows how gapped sequences can be created by inserting white space into the amino acid sequence. Several methods exist to define the gaps.

DemoProtein_userObjects.java: If your application requires data fields not yet existing in StrapProtein you can associate data with protein Objects. This is resembles JComponent#putClientProperty(Object,Object) and JComponent#getClientProperty(Object) .

DemoResidueSubset.java shows how a subset of residues (range of amino acid positions) can be extracted from a protein. As an example, you might be interested in the protein without the leading signal sequence.

DemoProtein_gaps_advanced.java shows some advanced methods concerned with gaps in sequences.
DemoProtein_nucleotides_advanced.java shows some advanced methods concerned with nucleotides and translation into amino acids.
DemoResidueSubset2.java shows how an expression denoting a ( contiguous or non-contiguous) set of residues is used to define a subset of the protein. The residues are given either by their index (start at 1 !) or by the pdb number and chain. It follows the convention of residue subsets in Rasmol.

ProteinParser

A ProteinParser collects information from a text (or text-file) and sets the amino acid sequence and other information in a protein object.

DemoProteinParser1.java shows how the default protein parsers work. A file name is given at the command line and the protein file is parsed. For each protein file type a different ProteinParser exists. All parsers are tried one after the other until one parser recognizes the format. Using this approach file types like SWISSPROT, EMBL and PDB are recognized automatically.
DemoProteinParser2.java shows how a self-written ProteinParser is designed and used. This is necessary only in rare cases if protein file formats need to be imported that are not yet supported by STRAP.

AlignmentWriter

An AlignmentWriter creates a multiple sequence alignment text from a number of StrapProtein's.

DemoAlignmentWriter.java: In the example two StrapProtein's are created and written as an alignment in MSF-format.

ProteinWriter

A ProteinWriter creates a text-file for a protein object.

DemoProteinWriter.java shows how proteins are written using the class ProteinWriterFasta.

ProteinViewer

A ProteinViewer is used to display proteins three-dimensionally. The 3D-information is usually given in a PDB-file.

DemoViewProteinBackbone.java creates a StrapProtein from a PDB-file and displays the 3D-structure.

DrawGappedSequence

The class DrawGappedSequence has methods to draw a gapped protein sequence graphically. If there are annotated residues they will be be highlighted.

DemoDrawGappedSequence.java creates a StrapProtein by an amino acid sequence, subsequently inserts a gap and defines a residue annotation. The protein sequence is then displayed in a graphical window.

Superimpose3D

A Superimpose3D aligns two protein backbones three-dimensionally.

DemoSuperimpose1.java superimposes two proteins three-dimensionally. Subsequently the new coordinates are written.

ResidueAnnotation

A ResidueAnnotation selects a subset of amino-acids (or nucleotides).

DemoResidueAnnotation.java defines a StrapProtein and adds a ResidueAnnotation to this protein.
DemoResidueAnnotation_NT.java defines a StrapProtein from a nucleotide sequence and adds a ResidueAnnotation object to this protein. The example shows that the residue selection can also be defined by nucleotide positions of the coding sequence. An amino-acid is selected if at least one of the three nucleotides of the triplet is selected.

ProteinProteinValue

The class ProteinProteinValue returns a value for pairs of proteins. ProteinProteinDistance and ProteinProteinSimilarity are subtypes. For ProteinProteinDistance pairs of similar proteins get small values and dissimilar proteins get large values. This can be used for distance matrices. For ProteinProteinSimilarity the returned value is the larger the more similar the two proteins are.

Example is under construction.

Biojava

To use BioJava examples you also need to download the jar files of the (23 Megabytes).
 $DOWNLOAD http://www.biojava.org/download/bj16/rc1/all/biojava-all.jar 
 jar -xf biojava-all.jar 
 export CLASSPATH=.$PATHSEP"strap.jar"$PATHSEP$(echo  biojava-live/*.jar | tr ' ' "$PATHSEP")$PATHSEP
      
To run the BioJava-examples copy the following lines into the command shell:

  javac DemoBiojavaSequence2StrapProtein.java;      java DemoBiojavaSequence2StrapProtein
  javac DemoStrapProtein2BiojavaSequence.java;      java DemoStrapProtein2BiojavaSequence

      
BioJava is a set of modules and packages for biology, including sequence analysis, database access, and parsers for sequence files. An interface to use STRAP and BioJava together is provided. This is useful because BioJava has methods, which are not contained in STRAP and vice versa More information is found in biojava.html

BiojavaSequence2StrapProtein

The class JAVADOC:BiojavaSequence2StrapProtein is used for converting BioJava sequence objects GappedSequence including annotations and features into StrapProtein objects.

DemoBiojavaSequence2StrapProtein.java: In this example a BioJava-sequence GappedSequence is created. This BioJava object is then converted into a StrapProtein.

StrapProtein2BiojavaSequence

The class JAVADOC:StrapProtein2BiojavaSequence is used to convert a StrapProtein object into a gapped BioJava GappedSequence object including annotations and features.

DemoStrapProtein2BiojavaSequence.java creates a StrapProtein object and converts it to a gapped BioJava sequence GappedSequence. The latter is displayed with renderers provided by BioJava.

Execution speed

The API of STRAP is optimized for speed and supports the analysis of thousands of proteins. Java is very fast (similar to C++), unless objects like String objects are frequently created. If performance matters, those methods of the Strap-API that return byte arrays rather than String-objects should be used. Never change elements within the returned arrays directly. Always use the set-methods! For some methods the returned byte arrays might be longer than the valid data and the current length must be obtained with special methods like countResidues(). For some methods that return byte-arrays, a version with the suffix *ExactLength() is available that return the character string with exactly the valid length.

Related Bio* projects

Toolboxes for processing biological data exist for various computer languages:

People

Christoph Gille is a medical doctor working as a scientist at the Institute of Biochemistry of the Charite University Hospital in Berlin, Germany. He has developed STRAP because he needed an alignment program to align a large number of sequences of the proteosome. He is the initiator of this project.

Peter Robinson is a Medical Geneticist and Bioinformatician at the Institute of Medical Genetics of the Charite University Hospital in Berlin, Germany. He initially learnt STRAP while trying to get a handle on protein structure Bioinformatics and has since become an enthusiastic user and developer. Other interests in Bioinformatics include Gene Ontology, string algorithms, and micro-array analysis; source code, links, and other half-baked ideas can be found at his website www.charite.de/ch/medgen/robinson.

Kristian Rother is a Biochemist at the Institute of Biochemistry of the Charite University Hospital in Berlin, Germany. He initiated the database project COLUMBA - A database of annotated protein structures. His favorite research project is the packing density of proteins and the occurrence of cavities in protein structures.

Contact

Please send bug reports, recommendations feedback to christoph.gille a t charite.de