home > MRes Biomolecular Sciences > > Lecture 1 > Lecture 2 > Lecture 3

MRes Biomolecular Sciences Lecture Notes: 2. Bioinformatics of Protein Evolution part I


Table of Contents
Introduction
Structure of proteins revisited
Classification of proteins
Some classification systems
Summary

This the third of seven sets of notes (numbered 0-6) designed to summarize and supplement the content of a series of lectures given by Damian Counsell at Imperial College London as part of the MRes course in Biomolecular Sciences.

These notes and associated materials will be made available under a modified form of the Open Content Licence.

Damian Counsell is at the Medical Research Council's Human Genome Mapping Project Resource Centre, Cambridge, UK.


Introduction

Proteins are the substrates of selection, that is, genes are selected for or against because the proteins they specify are more or less fitted to their environments. The study of the evolution of life is therefore the study of protein evolution.

In biomedical science and pure biology we want to study the functions of living systems. Function is directly dependent upon structure. Sequence is cheap to obtain (or already has been obtained) for many proteins. Structure, in contrast, is expensive to obtain. Sequence specifies structure, however. One of the principal aims of bioinformatics is to estimate the functions of gene products from the analysis of gene sequences.


Structure of proteins revisited

Protein structure

I reviewed in greater detail the aspects of protein structure, as described in the first lecture.


The Central Dogma

Slide 6 is a recap of The Central Dogma


Gene expression and disease

Slide 7 shows how errors in genes (DNA) result in errors in gene products (proteins).


Amino acids

  • A generic amino acid

    • common structure template each with specific sidechain properties

    • rotation in main axis possible about alpha-carbon

      • hence Ramachandran checking method for structure verification

    • Levinthal paradox

      • not enough time for all the amino acid residues in polypeptide to explore all rotations




Primary structure

  • 20 naturally ocurring amino acids with differing side-chains once incorporated into proteins

  • 4 main broad non-exclusive classes




Secondary structure

  • common polar amino acid template requires balancing of polar atoms in solution

  • 3 main classes of repeating arrangement to achieve this

    • alpha helix: charge balancing with immediate neighbours

    • beta sheet: charge balancing with distant neighbours

      • anti-parallel

      • parallel



Of course there are regions of proteins, mostly at the outside of the hydrophobic core where there is no regular repeating secondary structure, but instead the primary structure is arranged into "random" loops. They are not random—just very difficult to predict.


Tertiary structure

  • arrangement of secondary structural elements to acheive stable domains

  • secondary structural elements are regular, but not rigid




Quaternary structure

  • ordered arrangement of proteins in hetero or homomeric complexes

  • most proteins in vivo are associated with other proteins




Information and structure hierarchies

See slide 13 for a diagram of how the heirarchy of protein structure relates to the Central Dogma


Classification of proteins

The Growth of the Protein DataBank (PDB)

The Protein DataBank is no longer a book. It is now an online computer depository

  • It is maintained at the RCSB (amongst other places).

  • It currently contains over 17 000 entries—my slide number 15 is incorrect.

  • On average 4.5 new entries are added per day.

  • However, there are fewer folds than structures.




Gross protein groupings

When discussing the possible arrangements of secondary structural units into in higher level configurations the terminology can be a little confusing—not least of all because different groups classifying proteins attach slightly different meanings to the same words. Here I attempted to give a guide to the terms used in describing super-secondary structural classes and examples of real protein structural elements corresponding to each term.


Domains…

…are "self-contained" folding units. There are, in fact, a number of definitions of the term domain, but what they usually have in common is that the thing they refer to, is to some extent at least, independent or self-contained.

  • The biological/biochemical definition of a protein domain is a unit of protein structure which can fold autonomously, that is without the help of other parts of the the "parent" molecule or other molecules

  • The evolutionary definition of a protein domain is a structural motif which is found in more than one protein structure, that is a conserved element of higher order than secondary structure.

  • The functional/biochemical definition of a protein domain is a subset of a whole protein sufficient to perform one of the whole protein's activities without the rest of the molecule needing to be present.




Folds

Folds again have a fuzzy definition. They are often defined in a similar way to domains, the term tends to be used for larger-scale structural forms than domains. They are more like motifs (recurring patterns) of domain and superdomain structures. (Again I use the "super-" prefix here to refer to "larger-scale", rather than meaning just "larger".)


Topologies

"Topology" is used in mathematics when describing the study geometrical relationships for which distance is not significant, but connectivity is. In bioinformatics we use the term to refer to "maps" of folds. By that we mean the patterns of connectedness of higher order structural elements; whether a helix is connected to the amino or carboxy-terminal of a sheet before the sheet itself attaches to one end of another helix. Topological diagrams are a bit like what mathematicians call "graphs"---diagrams of points of various types linked by lines called "edges".

Read the introductory section of this more detailed explanation. Look at the documentation of the TOPS database for a practical demonstration.


Superfamilies

The term Superfamilies was coined by one of the pioneers of what we now call bioinformatics, Margaret Dayhoff (1974).

  • proteins with a structural and evolutionary relationship are grouped together in superfamilies

  • PIR (protein database) definition

    • Georg (1995)

    • transitively closed

      • A*B, B*C, A*C

    • homeomorphic—same domains in same order, can be aligned over entire length (except ragged ends)



  • superfamilies

    • superfamilies (Dayhoff 1974)

    • groups of proteins with common structural and evolutionary characteristics

    • now blurred

    • closed under transitivity (George 1974)

      • A*B, B*C, A*C


There are fewer folds than structures

From X-ray crystallography, NMR and electron microscopy it seems only a limited number of globular protein structure motifs (folds) account for the frameworks of most protein structures. By 1996, the European Bioinformatics Institute's Dali fold library had classified over 4000 known protein structures into non-redundant folds and found that, on average, only one new fold class was added to the library for every 15 new entries. The PDB is highly redundant. The same authors' state in their recent overview of the latest versions of Dali and HSSP, the related fold-to-sequence alignment database:

"In September 1998, all known protein structures were completely described in terms of 771 fold types..."

Further, these folds are likely to be well represented in the as-yet-unsolved proteins for which we have reliable, curated sequence data. From X-ray crystallography, NMR and electron microscopy it seems only a limited number of globular protein structure motifs (folds) account for the frameworks of most protein structures. By 1996, the European Bioinformatics Institute's Dali fold library had classified over 4000 known protein structures into non-redundant folds and found that, on average, only one new fold class was added to the library for every 15 new entries. The PDB is highly redundant. The same authors' state in their recent overview of the latest versions of Dali and HSSP, the related fold-to-sequence alignment database:

"The HSSP database associates 1D sequences with known 3D structures using a position-weighted dynamic programming method for sequence profile alignment (MaxHom). As a result, the HSSP database not only provides aligned sequence families, but also implies secondary and tertiary structures covering 36% of all sequences in Swiss-Prot."



This is a central idea in the whole field of structural protein bioinformatics. Fold space—the range of all possible protein folds—can be "covered"—that is, approximately represented—by a subset of well-chosen structures. In the same way, perhaps that "carspace" can be approximately represented by a collection of well-chosen cars, e.g. a Golf to represent a large hatchback, a Mercedes saloon to represent a luxury car, a Land Rover to represent a 4x4 vehicle.

The importance of this will become clearer when we discuss the protein folding problem and protein structure prediction.


Some classification systems

How can we assign proteins to these folds? A whole field of bioinformatics and several large and important database projects have been built up around this question of the useful categorization of protein structure templates. Here I summarize the characteristics of three such systems—SCOP, CATH, and FSSP (as generated by the DALI server)— and give links to their Web sites so that you can read about them in more detail.


SCOP

description of SCOP

"created by manual inspection and abetted by a variety of automated methods, [SCOP] aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known"




Construction of SCOP

  • Structural Classification of Proteins

  • Alexei Murzin et al. Laboratory of Molecular Biology, Cambridge

  • mainly manual

  • structural, evolutionary, functional




CATH

description of CATH

  • Class, Architecture, Topology and Homologous superfamily

  • Janet Thornton, European Bioinformatics Institute, Cambridge

  • partly automated heirarchy of organization




The CATH hierarchy

  • Class

    • secondary structure content

      • all alpha

      • alpha and beta

      • all beta

  • Architecture

    • secondary structure similarly arranged, e.g.

      • TIM barrel

      • alpha-beta sandwich

      • jelly roll

  • Homologous family

    • clear evolutionary relationship




Statistics of CATH pyramid

Refer to Website for full up-to-date figures on latest release


FSSP

description of FSSP

  • Functional classification based on Structure-Structure alignment of Proteins

  • derived with computer program: Dali




construction of FSSP

  • all of PDB

    • representative set

    • sequence homologues (greater than 25% identity

  • No homologues in representative set

  • continuously and automatically updated

    • smart structural supervision algorithm




Summary




home > MRes Biomolecular Sciences > > Lecture 1 > Lecture 2 > Lecture 3