This the fourth of seven sets of notes (numbered 0 to 6) designed to summarize and supplement the content of a series of lectures given by Damian Counsell at Imperial College London as part of the MRes course in Biomolecular Sciences.
These notes and associated materials will be made available under a modified form of the Open Content Licence.
Structure determination is difficult, sequencing is not. When we wish to investigate a gene product, knowing its structure can give us clues about its function, and more importantly help us to make an intelligent choice of the next experiment to perform in order to obtain the most information about the molecule. Site-directed mutagenesis, for example, can be highly informative, but it saves a great deal of time and effort if we can make a rational choice of which residues to mutate. Computational protein structure prediction can help us to do this.
Protein structure prediction is a a relatively young area of bioinformatics research, but a fast growing one. Despite this, modelling and prediction tools have already been built specifically for naïve users to produce approximate models of the three-dimensional structures of protein gene products, solely from their newly acquired nucleotide sequence. Generating such models by bioinformatic means is, of course, a great deal easier and less expensive than cloning, expressing and crystallizing proteins to determine their structures.
Such tools should not be trusted. When you do any kind of structure prediction for yourself your results will not just be wrong; if you use a so-called knowledge-based method, the data on which the predictions are based will itself be wrong already. The most successful protein structure prediction technique, comparative (or homology) modelling, is, at its best, only as good as the quality of the structure used as a template to build the model. It is important to remember that experimentally determined biomolecular structures are themselves models albeit ones constrained by vast numbers of real world data.
Many of the best, knowledge-based bioinformatic methods depend not only on whether or not your uncharacterized query protein has an easily identified homologue, but on the number of members of its family of homologues already available. Unless a very poor method is chosen the quantity and quality of the existing template data are far are often more important than the modelling program used.
Here is a possible sensible programme of steps to use in the process of characterizing a newly sequenced gene product:
Literature search
Sequence search
Motif search
Sequence alignment
Secondary structure prediction
Tertiary structure prediction
Refinement
Checking
It might seem unusual to make the first stage of a bioinformatics investigation a trip to the library, but this part of the journey often turns out to be more important than the others. Researchers in biological fields can avoid many serious errors by simply getting to know the area they are investigating first. Apart from anything else, someone may already have built a model of the your protein(s) of interest.
These days most biomedical literature databases are cross-linked with bioinformatics resources; for example, a paper in which several slightly varying isoforms of a given enzyme are compared biochemically might link to the corresponding entries for their actual protein sequences in a genomic database, or the co-ordinates of their constituent atoms in a structure database.
Avoid stupid errors
work might not be original
Cheap
Only need an Internet connection
Abstracts free
Journals are human-curated
real biological knowledge
Literature now linked extensively with other databases
Structure databases
Sequence databases
Expensive
obtaining reprints can be costly
Error prone
Humans blindly copy others' mistakes
Exceptions
Your sequence might be an outlier in a family
Your sequence might be from a related, but distinct family
If you intend to build a comparative model it is essential to have a known template protein structure against which to model your own unknown one. A good tool for finding a good structure match, in the absence of structure information about your own, unknown (query) sequence, is (PSI)-BLAST. PSI-BLAST is a clever derivative of BLAST, the most commonly used bioinformatics program.
The first section of this page at the NCBI, where BLAST was developed, gives an outline of the PSI-BLAST approach. As you might imagine, this method is at its best when there is a good match for your sequence has already been sequenced and better still when that match is part of a family of similar proteins. Of course, good sequence matches are not a great deal of use for comparative models if there is no structural information on the sequences identified.
If you learn about no other algorithms in bioinformatics, learn about these.
BLAST
most commonly used bioinformatics program
fast, local, originally ungapped
many refinements
FASTA
Lipman and Pearson
Sensitive
Word-based first pass
Several nearby hits required
Can miss multiple module matches
Smith-Waterman
Rigorous
most sensitive
dedicated hardware can be used for more speed
Perhaps the quickest and "cheapest" discriminator of protein function is motif search. Motifs, short loosely-defined patterns of polypeptide sequence, are both sensitive (few false negatives) and specific (few false positives) identifiers of function.
There are several very useful motif databases, with search facilities. One ofthe best known is Prosite.
Alignment is the most important stage in any comparative modelling process. Not only are alignment algorithms used to find the sequence of a suitable structure template, but, once a good match has been found, the sequence of your query sequence must be lined up with that of the identified template. If the wrong parts of the query protein are aligned with the wrong parts of the target (template) protein then your model-building will be seriously awry.
Heuristic ("rule-of-thumb") methods such as BLAST tend to be used to finding matches and complete, dynamic programming, methods such as the Needleman and Wunsch or Smith-Waterman methods used for alignment of matched sequences. This short summary of a presentation by Christopher Dwan of The Center for Computational Genetics and Bioinformatics at the University of Minnesota explains clearly and concisely the difference between using the BLAST and Smith-Waterman algorithms for sequence search. Read it.
Automated alignment methods, even (especially) those guaranteed to find mathematically optimum alignments of sequences do not necessarily make the best structural alignments. This is one of the main reasons why all the best protein predictions involve human intervention. Before you make any attempt to build a comparative (homology) model you should perform a manual alignment.
BLAST outputs local alignments
Needleman and Wunsch
global optimum guaranteed
Can miss multiple module matches
Smith-Waterman
optimum alignments guaranteed
computationally expensive
too much for most searches
Multiple alignment-based techniques are the most accurate, but
always do a manual alignment if you can
corrects errors in pairwise alignments by averaging
clarifies internal structure of protein families
highlights conserved residues
which in turn highlight important structural elements
can be used for more sensitive further analysis
profiles
PSI-BLAST
original manual multiple alignments foundation of bioinformatics
generation of scoring matrices
The question you should ask when doing any kind of bioinformatic analysis is (or, indeed, any kind of science) is not "Is my answer wrong?", but "How wrong is my answer?". For some purposes, rough estimates are perfectly useful. If you have used other methods shrewdly protein structure prediction can offer insights unavailable by any other techniques.
One of the few things that scientists in the field of protein folding seem to agree on is that secondary structure forms early on this process.
Identification of secondary structural elements makes the topology of your structure more obvious—so that similar ones can be identified in a topology database such as TOPS—. Prediction of the positions and lengths of secondary structure elements can be used as a prelude to "docking" these secondary structural elements against each other.
Information about the division of primary structure into secondary structure "chunks" is valuable in the construction or refinement of primary structure alignments. When homologues are aligned and compared, primary structure differences are far more common than secondary structure differences. Secondary structure can be a better guide to the correct correspondence between parts of two proteins' respective tertiary structures.
Lastly, in the absence of a useful known structural match for your query sequence, a secondary structure prediction can be used to make some kind of intelligent guess about the higher order structure of your protein.
Here is a brief and slightly dated review of secondary structure prediction (as it stood in the mid-90s).
context dependent
early stage in folding, but…
…does not feed reliably through to tertiary structure
In some kinds of modelling secondary structural elements may be "docked" together
elements may be predicted to identify the topology of a structure
TOPS server
generally, only 50% of a structure is alpha-helix or beta-sheet
beta-strands have necessarily longer-range associations
Read the following sections about sequence signals for secondary structure motifs in sequence in conjunction with slides 19–22.
amphipathic helices interact with both protein core and solvent
characteristic hydrophobicity profiles
prolines disrupt the middles of helices
manual prediction of semi-exposed alpha-helix
period of 3.6
conserved hydrophobics at
i
i plus 3
i plus 4
i plus 7
edge strands alternate hydrophobic/hydrophilic
centre strands all hydrophobic
strands are extended so there are few residues per core span
manual prediction of half-buried beta strand
period of 2
conserved hydrophobics at
i
i plus 2
i plus 4
i plus 6
i plus 8
manual prediction of completely buried beta strand
found in alpha-beta proteins
conserved hydrophobics run continuously
gapped in multiple alignments
small polar residues
Ala
Gly (v. small so flexible)
Ser
Thr
Prolines rarer in other kinds of seconddary structure
Secondary structure prediction is a classic problem in bioinformatics and many different approaches have been devised. They can be broken down into various categories, though secondary structure prediction methods are often hybrids of multiple approaches. This is typical of bioinformatics, where it is often necessary to make pragmatic choices between multiple and varied approaches to formally insoluble problems. (By "formally insoluble" I mean problems that have been shown mathematically to be intractable—they can't be solved correctly by any computer in a meaningful length of time.)
The Chou and Fasman [ChouAndFasman78] and Garnier [GarnierOsguthorpeRobson78] et al. devised long-standing secondary structure prediction methods based on analyses of the, then-small, protein structure databases. They asked what are the "rules" for obtained helices, what are the rules for obtaining sheets and encoded them. Some knowledge-based methods are actually comparative methods and simply align the unknown sequence against a similar one for which there is already experimentally obtained secondary structure data. There can, of course, be very reliable.
In these knowledge obtained from databases is encoded in patterns of probabilities stored in a non-linear computer model. Most of the more reliable modern methods are based on such approaches. An example of such a model is a neural net. Neural nets handle signals in streams of input data in ways which are crudely similar to the responses of networks of excitable cells in chordates, for example the neurons in our central nervous system.
Neural networks are popular in secondary structure prediction. The so-called "input layer" of a neural net is provided with a signal consisting of the one-dimensional peptide sequences of all known structures and the behaviour of the intervening network elements adjusted so that the "output layer" of the same net produces the desired output signal: a corresponding sequence of secondary structure assignments. This adjustment process is referred to as learning . When it is complete and the input layer is presented with a previously "unseen" peptide sequence it should output an assignment sequence of secondary structure assignments; "residues one to ten are helix, residues 10 to 14 loop…" and so on.
Some methods use more than one computational approach and combine them in some way internally to produce their output. Others use multiple methods externally and make some kind of summary of the output---a consensus see: Consensus Methods below.
Perhaps the highest scoring methods of secondary structure prediction are those which "poll" a number of different sources for a prediction and display a filtered combination of the results obtained.
There are many techniques which have been developed to tackle the actual business of modelling a protein fold. Many of these require specialized knowledge or are highly variable in their performance. WHATIF is a suite of bioinformatics programs which has at its core a homology modelling module. This module is both reasonably accurate and is just friendly enough for a non structural biologist to apply. Most importantly it gives very good feedback to the user on the quality of the models it creates; it answers the question: "How wrong is my answer?".
WHATIF's comparative modelling module is conservative in more than one sense of the word. Given an alignment between the query sequence and a suitable template, WHATIF substitutes query residues into the appropriate parts of the template sequence. It is up to user to produce a good alignment. As it edits these in and determines their most likely arrangement within the "new", template framework, the module leaves the template's conserved residues as unchanged as possible and substitutes changed, query residues according to a database of known fragments from a carefully chosen subset of existing known structures. This subset of known structures is, of course, chosen for its quality.
In fact, the methods used to check for the quality of these template structures when they are originally submitted by experimentalists are the same as the quality checking routine which WHATIF applies to the models it generates. These techniques are also applied to experimentally obtained structures before their submission to the PDB and are built into WHATCHECK, the structure validation component of WHATIF.
Once a comparative model has been built by analogy with existing structures, using both structural components of the template itself and of other known proteins, physico-chemical modelling methods can be used to correct mistakes in this model. Obvious errors such as putting two atoms in the same space and impossible bond angles can be dealt with first, followed by tackling more subtle problems such as improbably high energy structures. Using techniques such as energy minimization for "improving" the structures of models is dangerous. Frequently it fixes local problems at the expense of creating local errors. Often it just makes a model worse.
Here is an excellent review of the process of homology modelling by Gert Vriend.
Sequence search may fail to identify a match against which to model your protein sequence. If your sequence codes actually codes for a novel fold this is good, because any pure comparative modelling effort would be doomed. If there is a protein of similar tertiary structure (fold) to your query protein, but dissimilar primary structure (sequence) it might be identified by a fold recognition program. The mother of all fold-recognition programs is THREADER.
Unfortunately fold recognition is a branch of bioinformatics characterized more by promise and innovation than actual success. The best that can be said is that, with test sequences, Threader, for example, is very good at ranking known structural homologues near the top of its ratings (in the absence of strong sequence match).
A very large proportion of all the structures in a cell are transmembrane proteins. The peculiar properties of proteins which live in a lipid environment make them especially difficult to investigate by X-Ray crystallography or NMR spectroscopy. There are therefore very few transmembrane protein structures available. This is doubly unfortunate since the urgent need to model such functionally critical structures is frustrated by the the shortage of good structures to model uncrystallized transmembrane sequences against.
Protein stability in a lipid membrane requires
hydrophobic sidechains
regular secondary structures
Two architectures which meet these requirements are
bundles of hydrophobic alpha-helices
beta barrels
These templates can therefore be used to account for a large number of transmembrane protein structures. Slide 33 shows two examples of the latter.
Since these structures are all found in membranes. The well-known (and relatively large) molecular components can be used to constrain models. We also have available to us a great deal of sequence data and a variety of other experimental techniques, such as electron microscopy (e.m.) that can be used to determine certain important pieces of structural data about these species.
The alternative to threading, when there really is no useful homologue detectable or available, is to build a model of your protein from first principles. Obtaining the three-dimensional structure of a protein solely from the physico-chemical laws to the atoms of its residues—ab initio protein structure prediction—is a problem of provably immense difficulty.
There is an annual competition between protein modelling scientists called CASP (Critical Assessment of Techniques for Protein Structure Prediction). Its aim is to compare the effectiveness of different bioinformatics groups' approaches to protein structure prediction. Modelling groups are given sequences of crystallized, but as-yet-to-be-determined protein structures by experimentalists. The modellers then have to build models of these sequences' tertiary structures "blind"—that is, in the absence of experimental data.
When, eventually, the "blind" models are compared to the "true" structures comparative/homology modelling is so consistently superior in its results that ab initio modellers compete in a separate part of the competition (as do those devising methods for fold identification).
During the lecture I presented an example of Rosetta ab initio modelling which has had some limited success with particular folding problems (slides 47–50).
Until recently there have been two main criteria for experimentalists' choice of target:
the biological importance of a molecule and
ease with which it could be crystallized and its structure determined
There are efforts now to provide a more comprehensive knowledge of the sum total of all structures and/or the realm of available protein architectures—"foldspace".
Crystallizing proteins is labour intensive and something of a "black art". Higher throughput and more systematic approaches are being devised to increase the output of structures in total. If these are combined with comparative modelling then we can add a new criterion when choosing structures for crystallization: how informative a structure is. If we believe a gene codes for a novel fold and/or has multiple homologues of unknown structure, we might place a higher priority on determining its structure—even though it might not code for a product of great biological significance. In this way we could obtain a greater coverage of fold templates from which we could model a much larger number of useful structures.
Literature and sequence search often tell us more, more reliably about the product of a gene we have sequenced than models
The ab initio protein folding problem is insoluble.
All models should be checked as much as possible and refined as little as possible.
Structural genomics may render modelling (semi-)redundant.
[ChouAndFasman78] Adv. Enzymolog. Relat. Areas Mol. Biol., "Prediction of the secondary structure of proteins from their amino acid sequence", P. Y. Chou, G. D. Fasman, 1978, 47, 45-147.
[GarnierGibratRobson96] Methods Enzymol., "GOR method for predicting secondary structure from amino acid sequence", J. Garnier, J.-F. Gibrat, B. Robson, 1996, 266, 540-553.