Interpreting protein files
Proteins are stored in plain text files. A typical error is to save protein files as MS-Word-Document
because the proprietary WORD-format is not recognized by STRAP.
Recognition of file formats in STRAP works by try-and-error disregarding of the file suffix.
At first the DSSP-format is assumed which in addition to the
amino acid sequence also contains the C-alpha coordinates and the
secondary structure definition.
DSSP-files are calculated from PDB-files with the program DSSP
by Kabsh and Sander which is freely available in the Web (see ).
**** SECONDARY STRUCTURE DEFINITION BY THE PROGRAM DSSP, VERSION OCT. 1985 **** MONTH=12 DAY=24 YEAR=2001
# RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI X-CA Y-CA Z-CA
1 -9 H L 0 0 117 0, 0.0 2,-0.1 0, 0.0 101,-0.1 0.000 360.0 360.0 360.0 125.0 40.8 -139.0 51.3
2 -8 H K > - 0 0 118 1,-0.1 3,-1.5 2, 0.0 2,-0.2 -0.349 360.0 -86.7 -69.0 147.8 43.1 -138.1 48.4
3 -7 H K T 3 S- 0 0 205 1,-0.2 -1,-0.1 -2,-0.1 0, 0.0 -0.321 106.3 -4.6 -59.3 118.2 45.4 -140.7 46.9
4 -6 H G T 3 S+ 0 0 68 1,-0.2 -1,-0.2 -2,-0.2 -2, 0.0 0.372 92.0 135.9 84.1 -4.0 48.7 -140.7 48.9
In case the protein file does not comply with the DSSP format the PDB-format (nk_(file_format)) is tried.
Those lines containing CA in the third column describe C-alpha atoms.
Initially, STRAP loads only the C-alpha atoms,
but side chain atoms are read on demand.
For a short explanation of the PDB-format see nk_(file_format).
There might be residues in a protein model without corresponding
ATOM-lines but which are recorded in the SEQRES lines of the pdb-file.
These residues are written in lower case in the alignment panel indicating that they have no
3D-coordinates.
The interpretation of SEQRES-lines can be deactivated by unselecting
or with the command line option "-noSeqres" .
If the file is not in PDB-format then the fasta-format is tested.
The fasta format is characterized by a greater than character followed by the header ( ).
The first series of non blank characters should consist exclusively of digits followed by white space of any length and an amino acid sequence.
EMBL-, - and -files follow this scheme and should be parsed correctly.
The header is almost ignored. We only look for the name of the compound and the organism to create some information texts.
files usually contain nucleotide sequence rather than amino
acids and nucleotides will be seen in the protein alignment.
Genbank files can be interpreted with the dialog Translate Genbank nucleotide files ...[Menu-bar>Protein>Nucleotide sequence] .
For other nucleotide sequences the reading frame and the translated regions can be set manually
(Reading frame of nucleotide sequence ...[Menu-bar>Protein>Nucleotide sequence] ).
Three nucleotide bases yield one amino acid.
Finally, when no specific format was recognized all letters in the file are used as one letter codes of amino acids.
File compression:
Files ending with .gz, .bz2, .Z or .zip will be decompressed automatically.
Problems:
- pdb1a6y.ent.Z: last residue in SEQRES is a MET and is not in ATOMS
- pdb1acb: SEQRES 1 I 70 THR GLU PHE GLY where is it ?