home > MRes Biomolecular Sciences > > Lecture 3 > Lecture 4 > Lecture 5

MRes Biomolecular Sciences Lecture Notes: 4. Protein Structure Prediction


Table of Contents
Structure prediction works—and doesn't work
The Process of structure prediction
Secondary structure prediction
Tertiary structure prediction
Structural genomics
Summary
Bibliography

This the fourth of seven sets of notes (numbered 0 to 6) designed to summarize and supplement the content of a series of lectures given by Damian Counsell at Imperial College London as part of the MRes course in Biomolecular Sciences.

These notes and associated materials will be made available under a modified form of the Open Content Licence.

Damian Counsell is at the Medical Research Council's Human Genome Mapping Project Resource Centre, Cambridge, UK.


Structure prediction works—and doesn't work

Structure determination is difficult, sequencing is not. When we wish to investigate a gene product, knowing its structure can give us clues about its function, and more importantly help us to make an intelligent choice of the next experiment to perform in order to obtain the most information about the molecule. Site-directed mutagenesis, for example, can be highly informative, but it saves a great deal of time and effort if we can make a rational choice of which residues to mutate. Computational protein structure prediction can help us to do this.

Protein structure prediction is a a relatively young area of bioinformatics research, but a fast growing one. Despite this, modelling and prediction tools have already been built specifically for naïve users to produce approximate models of the three-dimensional structures of protein gene products, solely from their newly acquired nucleotide sequence. Generating such models by bioinformatic means is, of course, a great deal easier and less expensive than cloning, expressing and crystallizing proteins to determine their structures.

Such tools should not be trusted. When you do any kind of structure prediction for yourself your results will not just be wrong; if you use a so-called knowledge-based method, the data on which the predictions are based will itself be wrong already. The most successful protein structure prediction technique, comparative (or homology) modelling, is, at its best, only as good as the quality of the structure used as a template to build the model. It is important to remember that experimentally determined biomolecular structures are themselves models albeit ones constrained by vast numbers of real world data.

Many of the best, knowledge-based bioinformatic methods depend not only on whether or not your uncharacterized query protein has an easily identified homologue, but on the number of members of its family of homologues already available. Unless a very poor method is chosen the quantity and quality of the existing template data are far are often more important than the modelling program used.


The Process of structure prediction

Here is a possible sensible programme of steps to use in the process of characterizing a newly sequenced gene product:


Stages of bioinformatic structure prediction

  • Literature search

  • Sequence search

  • Motif search

  • Sequence alignment

  • Secondary structure prediction

  • Tertiary structure prediction

  • Refinement

  • Checking




Advantages of literature searches

It might seem unusual to make the first stage of a bioinformatics investigation a trip to the library, but this part of the journey often turns out to be more important than the others. Researchers in biological fields can avoid many serious errors by simply getting to know the area they are investigating first. Apart from anything else, someone may already have built a model of the your protein(s) of interest.

These days most biomedical literature databases are cross-linked with bioinformatics resources; for example, a paper in which several slightly varying isoforms of a given enzyme are compared biochemically might link to the corresponding entries for their actual protein sequences in a genomic database, or the co-ordinates of their constituent atoms in a structure database.

  • Avoid stupid errors

    • work might not be original

  • Cheap

    • Only need an Internet connection

    • Abstracts free

  • Journals are human-curated

    • real biological knowledge

  • Literature now linked extensively with other databases

  • Structure databases

  • Sequence databases




Disadvantages of literature searches

  • Expensive

    • obtaining reprints can be costly

  • Error prone

    • Humans blindly copy others' mistakes

  • Exceptions

    • Your sequence might be an outlier in a family

    • Your sequence might be from a related, but distinct family




Sequence search

If you intend to build a comparative model it is essential to have a known template protein structure against which to model your own unknown one. A good tool for finding a good structure match, in the absence of structure information about your own, unknown (query) sequence, is (PSI)-BLAST. PSI-BLAST is a clever derivative of BLAST, the most commonly used bioinformatics program.

The first section of this page at the NCBI, where BLAST was developed, gives an outline of the PSI-BLAST approach. As you might imagine, this method is at its best when there is a good match for your sequence has already been sequenced and better still when that match is part of a family of similar proteins. Of course, good sequence matches are not a great deal of use for comparative models if there is no structural information on the sequences identified.

If you learn about no other algorithms in bioinformatics, learn about these.

  • BLAST

    • most commonly used bioinformatics program

    • fast, local, originally ungapped

    • many refinements

  • FASTA

    • Lipman and Pearson

    • Sensitive

    • Word-based first pass

      • Several nearby hits required

    • Can miss multiple module matches

  • Smith-Waterman

    • Rigorous

    • most sensitive

      • dedicated hardware can be used for more speed




Motif search

Perhaps the quickest and "cheapest" discriminator of protein function is motif search. Motifs, short loosely-defined patterns of polypeptide sequence, are both sensitive (few false negatives) and specific (few false positives) identifiers of function.

There are several very useful motif databases, with search facilities. One ofthe best known is Prosite.


Sequence alignment

Alignment is the most important stage in any comparative modelling process. Not only are alignment algorithms used to find the sequence of a suitable structure template, but, once a good match has been found, the sequence of your query sequence must be lined up with that of the identified template. If the wrong parts of the query protein are aligned with the wrong parts of the target (template) protein then your model-building will be seriously awry.

Heuristic ("rule-of-thumb") methods such as BLAST tend to be used to finding matches and complete, dynamic programming, methods such as the Needleman and Wunsch or Smith-Waterman methods used for alignment of matched sequences. This short summary of a presentation by Christopher Dwan of The Center for Computational Genetics and Bioinformatics at the University of Minnesota explains clearly and concisely the difference between using the BLAST and Smith-Waterman algorithms for sequence search. Read it.

Automated alignment methods, even (especially) those guaranteed to find mathematically optimum alignments of sequences do not necessarily make the best structural alignments. This is one of the main reasons why all the best protein predictions involve human intervention. Before you make any attempt to build a comparative (homology) model you should perform a manual alignment.

  • BLAST outputs local alignments

  • Needleman and Wunsch

    • global optimum guaranteed

    • Can miss multiple module matches

  • Smith-Waterman

    • optimum alignments guaranteed

    • computationally expensive

      • too much for most searches

  • Multiple alignment-based techniques are the most accurate, but

  • always do a manual alignment if you can




Multiple alignment

  • corrects errors in pairwise alignments by averaging

  • clarifies internal structure of protein families

  • highlights conserved residues

    • which in turn highlight important structural elements

  • can be used for more sensitive further analysis

    • profiles

    • PSI-BLAST

  • original manual multiple alignments foundation of bioinformatics

    • generation of scoring matrices




How much should you trust your answers?

The question you should ask when doing any kind of bioinformatic analysis is (or, indeed, any kind of science) is not "Is my answer wrong?", but "How wrong is my answer?". For some purposes, rough estimates are perfectly useful. If you have used other methods shrewdly protein structure prediction can offer insights unavailable by any other techniques.


Secondary structure prediction

One of the few things that scientists in the field of protein folding seem to agree on is that secondary structure forms early on this process.

Identification of secondary structural elements makes the topology of your structure more obvious—so that similar ones can be identified in a topology database such as TOPS—. Prediction of the positions and lengths of secondary structure elements can be used as a prelude to "docking" these secondary structural elements against each other.

Information about the division of primary structure into secondary structure "chunks" is valuable in the construction or refinement of primary structure alignments. When homologues are aligned and compared, primary structure differences are far more common than secondary structure differences. Secondary structure can be a better guide to the correct correspondence between parts of two proteins' respective tertiary structures.

Lastly, in the absence of a useful known structural match for your query sequence, a secondary structure prediction can be used to make some kind of intelligent guess about the higher order structure of your protein.


Overview of secondary structure prediction

Here is a brief and slightly dated review of secondary structure prediction (as it stood in the mid-90s).

  • context dependent

  • early stage in folding, but…

  • …does not feed reliably through to tertiary structure

  • In some kinds of modelling secondary structural elements may be "docked" together

  • elements may be predicted to identify the topology of a structure

    • TOPS server

  • generally, only 50% of a structure is alpha-helix or beta-sheet

  • beta-strands have necessarily longer-range associations



Read the following sections about sequence signals for secondary structure motifs in sequence in conjunction with slides 19–22.


Signals for alpha helices

  • amphipathic helices interact with both protein core and solvent

    • characteristic hydrophobicity profiles

  • prolines disrupt the middles of helices



  • manual prediction of semi-exposed alpha-helix

    • period of 3.6

    • conserved hydrophobics at

      • i

      • i plus 3

      • i plus 4

      • i plus 7




Signals for beta strands

  • edge strands alternate hydrophobic/hydrophilic

  • centre strands all hydrophobic

  • strands are extended so there are few residues per core span



  • manual prediction of half-buried beta strand

    • period of 2

    • conserved hydrophobics at

      • i

      • i plus 2

      • i plus 4

      • i plus 6

      • i plus 8



  • manual prediction of completely buried beta strand

    • found in alpha-beta proteins

    • conserved hydrophobics run continuously




Signals for coils

  • gapped in multiple alignments

  • small polar residues

    • Ala

    • Gly (v. small so flexible)

    • Ser

    • Thr

  • Prolines rarer in other kinds of seconddary structure




Theory of automated secondary structure prediction methods

Secondary structure prediction is a classic problem in bioinformatics and many different approaches have been devised. They can be broken down into various categories, though secondary structure prediction methods are often hybrids of multiple approaches. This is typical of bioinformatics, where it is often necessary to make pragmatic choices between multiple and varied approaches to formally insoluble problems. (By "formally insoluble" I mean problems that have been shown mathematically to be intractable—they can't be solved correctly by any computer in a meaningful length of time.)


Knowledge-based methods

The Chou and Fasman [ChouAndFasman78] and Garnier [GarnierOsguthorpeRobson78] et al. devised long-standing secondary structure prediction methods based on analyses of the, then-small, protein structure databases. They asked what are the "rules" for obtained helices, what are the rules for obtaining sheets and encoded them. Some knowledge-based methods are actually comparative methods and simply align the unknown sequence against a similar one for which there is already experimentally obtained secondary structure data. There can, of course, be very reliable.


Machine learning methods

In these knowledge obtained from databases is encoded in patterns of probabilities stored in a non-linear computer model. Most of the more reliable modern methods are based on such approaches. An example of such a model is a neural net. Neural nets handle signals in streams of input data in ways which are crudely similar to the responses of networks of excitable cells in chordates, for example the neurons in our central nervous system.

Neural networks are popular in secondary structure prediction. The so-called "input layer" of a neural net is provided with a signal consisting of the one-dimensional peptide sequences of all known structures and the behaviour of the intervening network elements adjusted so that the "output layer" of the same net produces the desired output signal: a corresponding sequence of secondary structure assignments. This adjustment process is referred to as learning . When it is complete and the input layer is presented with a previously "unseen" peptide sequence it should output an assignment sequence of secondary structure assignments; "residues one to ten are helix, residues 10 to 14 loop…" and so on.


Hybrid methods

Some methods use more than one computational approach and combine them in some way internally to produce their output. Others use multiple methods externally and make some kind of summary of the output---a consensus see: Consensus Methods below.


Consensus methods

Perhaps the highest scoring methods of secondary structure prediction are those which "poll" a number of different sources for a prediction and display a filtered combination of the results obtained.


Example: JPRED, a secondary structure prediction consensus server

  • hosted at the European Bioinformatics Institute

  • polls a "jury" of methods:

    • PHD

    • Predator

    • DSC

    • NNSP

    • ZPRED

    • MULPRED




Tertiary structure prediction

WHATIF as an example comparative (homology) modelling program

There are many techniques which have been developed to tackle the actual business of modelling a protein fold. Many of these require specialized knowledge or are highly variable in their performance. WHATIF is a suite of bioinformatics programs which has at its core a homology modelling module. This module is both reasonably accurate and is just friendly enough for a non structural biologist to apply. Most importantly it gives very good feedback to the user on the quality of the models it creates; it answers the question: "How wrong is my answer?".

WHATIF's comparative modelling module is conservative in more than one sense of the word. Given an alignment between the query sequence and a suitable template, WHATIF substitutes query residues into the appropriate parts of the template sequence. It is up to user to produce a good alignment. As it edits these in and determines their most likely arrangement within the "new", template framework, the module leaves the template's conserved residues as unchanged as possible and substitutes changed, query residues according to a database of known fragments from a carefully chosen subset of existing known structures. This subset of known structures is, of course, chosen for its quality.


Checking and refinement

In fact, the methods used to check for the quality of these template structures when they are originally submitted by experimentalists are the same as the quality checking routine which WHATIF applies to the models it generates. These techniques are also applied to experimentally obtained structures before their submission to the PDB and are built into WHATCHECK, the structure validation component of WHATIF.

Once a comparative model has been built by analogy with existing structures, using both structural components of the template itself and of other known proteins, physico-chemical modelling methods can be used to correct mistakes in this model. Obvious errors such as putting two atoms in the same space and impossible bond angles can be dealt with first, followed by tackling more subtle problems such as improbably high energy structures. Using techniques such as energy minimization for "improving" the structures of models is dangerous. Frequently it fixes local problems at the expense of creating local errors. Often it just makes a model worse.

Here is an excellent review of the process of homology modelling by Gert Vriend.


What if… you can't find a homologue?

Sequence search may fail to identify a match against which to model your protein sequence. If your sequence codes actually codes for a novel fold this is good, because any pure comparative modelling effort would be doomed. If there is a protein of similar tertiary structure (fold) to your query protein, but dissimilar primary structure (sequence) it might be identified by a fold recognition program. The mother of all fold-recognition programs is THREADER.

Unfortunately fold recognition is a branch of bioinformatics characterized more by promise and innovation than actual success. The best that can be said is that, with test sequences, Threader, for example, is very good at ranking known structural homologues near the top of its ratings (in the absence of strong sequence match).


Membrane proteins

A very large proportion of all the structures in a cell are transmembrane proteins. The peculiar properties of proteins which live in a lipid environment make them especially difficult to investigate by X-Ray crystallography or NMR spectroscopy. There are therefore very few transmembrane protein structures available. This is doubly unfortunate since the urgent need to model such functionally critical structures is frustrated by the the shortage of good structures to model uncrystallized transmembrane sequences against.

Protein stability in a lipid membrane requires

  • hydrophobic sidechains

  • regular secondary structures



Two architectures which meet these requirements are

  • bundles of hydrophobic alpha-helices

  • beta barrels



These templates can therefore be used to account for a large number of transmembrane protein structures. Slide 33 shows two examples of the latter.

Since these structures are all found in membranes. The well-known (and relatively large) molecular components can be used to constrain models. We also have available to us a great deal of sequence data and a variety of other experimental techniques, such as electron microscopy (e.m.) that can be used to determine certain important pieces of structural data about these species.


Ab initio protein structure prediction

The alternative to threading, when there really is no useful homologue detectable or available, is to build a model of your protein from first principles. Obtaining the three-dimensional structure of a protein solely from the physico-chemical laws to the atoms of its residues—ab initio protein structure prediction—is a problem of provably immense difficulty.

There is an annual competition between protein modelling scientists called CASP (Critical Assessment of Techniques for Protein Structure Prediction). Its aim is to compare the effectiveness of different bioinformatics groups' approaches to protein structure prediction. Modelling groups are given sequences of crystallized, but as-yet-to-be-determined protein structures by experimentalists. The modellers then have to build models of these sequences' tertiary structures "blind"—that is, in the absence of experimental data.

When, eventually, the "blind" models are compared to the "true" structures comparative/homology modelling is so consistently superior in its results that ab initio modellers compete in a separate part of the competition (as do those devising methods for fold identification).

During the lecture I presented an example of Rosetta ab initio modelling which has had some limited success with particular folding problems (slides 47–50).


Structural genomics

Until recently there have been two main criteria for experimentalists' choice of target:



There are efforts now to provide a more comprehensive knowledge of the sum total of all structures and/or the realm of available protein architectures—"foldspace".

Crystallizing proteins is labour intensive and something of a "black art". Higher throughput and more systematic approaches are being devised to increase the output of structures in total. If these are combined with comparative modelling then we can add a new criterion when choosing structures for crystallization: how informative a structure is. If we believe a gene codes for a novel fold and/or has multiple homologues of unknown structure, we might place a higher priority on determining its structure—even though it might not code for a product of great biological significance. In this way we could obtain a greater coverage of fold templates from which we could model a much larger number of useful structures.


Summary




Bibliography

Journal articles

[ChouAndFasman78] Adv. Enzymolog. Relat. Areas Mol. Biol., "Prediction of the secondary structure of proteins from their amino acid sequence", P. Y. Chou, G. D. Fasman, 1978, 47, 45-147.

[GarnierGibratRobson96] Methods Enzymol., "GOR method for predicting secondary structure from amino acid sequence", J. Garnier, J.-F. Gibrat, B. Robson, 1996, 266, 540-553.

[GarnierOsguthorpeRobson78] J. Mol. Biol., "Analysis of the accuracy and implications simple methods for predicting the secondary structure of globular proteins", J. Garnier, D. J. Osguthorpe, B. Robson, 1978, 120, 45-147.

[Kneller90] J. Mol. Biol., "Improvements in protein secondary structure prediction by an enhanced neural network", Kneller, 1990, 214, 171-182.


home > MRes Biomolecular Sciences > > Lecture 3 > Lecture 4 > Lecture 5