This the third of seven sets of notes (numbered 0-6) designed to summarize and supplement the content of a series of lectures given by Damian Counsell at Imperial College London as part of the MRes course in Biomolecular Sciences.
These notes and associated materials will be made available under a modified form of the Open Content Licence.
Proteins are the substrates of selection, that is, genes are selected for or against because the proteins they specify are more or less fitted to their environments. The study of the evolution of life is therefore the study of protein evolution.
In biomedical science and pure biology we want to study the functions of living systems. Function is directly dependent upon structure. Sequence is cheap to obtain (or already has been obtained) for many proteins. Structure, in contrast, is expensive to obtain. Sequence specifies structure, however. One of the principal aims of bioinformatics is to estimate the functions of gene products from the analysis of gene sequences.
I reviewed in greater detail the aspects of protein structure, as described in the first lecture.
Slide 7 shows how errors in genes (DNA) result in errors in gene products (proteins).
A generic amino acid
common structure template each with specific sidechain properties
rotation in main axis possible about alpha-carbon
hence Ramachandran checking method for structure verification
not enough time for all the amino acid residues in polypeptide to explore all rotations
20 naturally ocurring amino acids with differing side-chains once incorporated into proteins
4 main broad non-exclusive classes
common polar amino acid template requires balancing of polar atoms in solution
3 main classes of repeating arrangement to achieve this
alpha helix: charge balancing with immediate neighbours
beta sheet: charge balancing with distant neighbours
anti-parallel
parallel
Of course there are regions of proteins, mostly at the outside of the hydrophobic core where there is no regular repeating secondary structure, but instead the primary structure is arranged into "random" loops. They are not random—just very difficult to predict.
arrangement of secondary structural elements to acheive stable domains
secondary structural elements are regular, but not rigid
ordered arrangement of proteins in hetero or homomeric complexes
most proteins in vivo are associated with other proteins
See slide 13 for a diagram of how the heirarchy of protein structure relates to the Central Dogma
The Protein DataBank is no longer a book. It is now an online computer depository
It is maintained at the RCSB (amongst other places).
It currently contains over 17 000 entries—my slide number 15 is incorrect.
On average 4.5 new entries are added per day.
However, there are fewer folds than structures.
When discussing the possible arrangements of secondary structural units into in higher level configurations the terminology can be a little confusing—not least of all because different groups classifying proteins attach slightly different meanings to the same words. Here I attempted to give a guide to the terms used in describing super-secondary structural classes and examples of real protein structural elements corresponding to each term.
…are "self-contained" folding units. There are, in fact, a number of definitions of the term domain, but what they usually have in common is that the thing they refer to, is to some extent at least, independent or self-contained.
The biological/biochemical definition of a protein domain is a unit of protein structure which can fold autonomously, that is without the help of other parts of the the "parent" molecule or other molecules
The evolutionary definition of a protein domain is a structural motif which is found in more than one protein structure, that is a conserved element of higher order than secondary structure.
The functional/biochemical definition of a protein domain is a subset of a whole protein sufficient to perform one of the whole protein's activities without the rest of the molecule needing to be present.
Folds again have a fuzzy definition. They are often defined in a similar way to domains, the term tends to be used for larger-scale structural forms than domains. They are more like motifs (recurring patterns) of domain and superdomain structures. (Again I use the "super-" prefix here to refer to "larger-scale", rather than meaning just "larger".)
"Topology" is used in mathematics when describing the study geometrical relationships for which distance is not significant, but connectivity is. In bioinformatics we use the term to refer to "maps" of folds. By that we mean the patterns of connectedness of higher order structural elements; whether a helix is connected to the amino or carboxy-terminal of a sheet before the sheet itself attaches to one end of another helix. Topological diagrams are a bit like what mathematicians call "graphs"---diagrams of points of various types linked by lines called "edges".
Read the introductory section of this more detailed explanation. Look at the documentation of the TOPS database for a practical demonstration.
The term Superfamilies was coined by one of the pioneers of what we now call bioinformatics, Margaret Dayhoff (1974).
proteins with a structural and evolutionary relationship are grouped together in superfamilies
PIR (protein database) definition
Georg (1995)
transitively closed
A*B, B*C, A*C
homeomorphic—same domains in same order, can be aligned over entire length (except ragged ends)
superfamilies
superfamilies (Dayhoff 1974)
groups of proteins with common structural and evolutionary characteristics
now blurred
closed under transitivity (George 1974)
A*B, B*C, A*C
From X-ray crystallography, NMR and electron microscopy it seems only a limited number of globular protein structure motifs (folds) account for the frameworks of most protein structures. By 1996, the European Bioinformatics Institute's Dali fold library had classified over 4000 known protein structures into non-redundant folds and found that, on average, only one new fold class was added to the library for every 15 new entries. The PDB is highly redundant. The same authors' state in their recent overview of the latest versions of Dali and HSSP, the related fold-to-sequence alignment database:
Further, these folds are likely to be well represented in the as-yet-unsolved proteins for which we have reliable, curated sequence data. From X-ray crystallography, NMR and electron microscopy it seems only a limited number of globular protein structure motifs (folds) account for the frameworks of most protein structures. By 1996, the European Bioinformatics Institute's Dali fold library had classified over 4000 known protein structures into non-redundant folds and found that, on average, only one new fold class was added to the library for every 15 new entries. The PDB is highly redundant. The same authors' state in their recent overview of the latest versions of Dali and HSSP, the related fold-to-sequence alignment database:"In September 1998, all known protein structures were completely described in terms of 771 fold types..."
"The HSSP database associates 1D sequences with known 3D structures using a position-weighted dynamic programming method for sequence profile alignment (MaxHom). As a result, the HSSP database not only provides aligned sequence families, but also implies secondary and tertiary structures covering 36% of all sequences in Swiss-Prot."
This is a central idea in the whole field of structural protein bioinformatics. Fold space—the range of all possible protein folds—can be "covered"—that is, approximately represented—by a subset of well-chosen structures. In the same way, perhaps that "carspace" can be approximately represented by a collection of well-chosen cars, e.g. a Golf to represent a large hatchback, a Mercedes saloon to represent a luxury car, a Land Rover to represent a 4x4 vehicle.
The importance of this will become clearer when we discuss the protein folding problem and protein structure prediction.
How can we assign proteins to these folds? A whole field of bioinformatics and several large and important database projects have been built up around this question of the useful categorization of protein structure templates. Here I summarize the characteristics of three such systems—SCOP, CATH, and FSSP (as generated by the DALI server)— and give links to their Web sites so that you can read about them in more detail.
"created by manual inspection and abetted by a variety of automated methods, [SCOP] aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known"
Structural Classification of Proteins
Alexei Murzin et al. Laboratory of Molecular Biology, Cambridge
mainly manual
structural, evolutionary, functional
Class, Architecture, Topology and Homologous superfamily
Janet Thornton, European Bioinformatics Institute, Cambridge
partly automated heirarchy of organization
Class
secondary structure content
all alpha
alpha and beta
all beta
Architecture
secondary structure similarly arranged, e.g.
TIM barrel
alpha-beta sandwich
jelly roll
Homologous family
clear evolutionary relationship
Functional classification based on Structure-Structure alignment of Proteins
derived with computer program: Dali
all of PDB
representative set
sequence homologues (greater than 25% identity
No homologues in representative set
continuously and automatically updated
smart structural supervision algorithm
introduction
protein structure revisited
terminology of protein structure classification
some classification systems
SCOP
CATH
FSSP