An Introduction to Protein Structure with WHAT IF

Structure prediction works---and doesn’t work

Structure determination is difficult; sequencing is not. When we wish to investigate a gene product, knowing its structure can give us clues about its function. More importantly this knowledge can help us to make an intelligent choices of the next experiment to perform to characterise the molecule. Site-directed mutagenesis, for example, can be a powerful experiemental technique for probing protein function, but it saves time and effort if we can make a rational choice of which residues to mutate. Computational protein structure prediction can inform such choices.


Protein structure prediction is a relatively young area of bioinformatics research, but a fast-growing one. Bioinformaticians have already built modelling and prediction tools that nave users can apply to produce approximate models of the three-dimensional structures of protein gene products, from the newly acquired nucleotide sequence of those genes. Generating such models by bioinformatic means is, of course, a great deal easier and less expensive than cloning, expressing and crystallizing proteins to determine their structures.


Such tools should not be trusted. When you do any kind of structure prediction for yourself your results will not just be wrong; if you use a so-called knowledge-based method, the data on which the predictions are based will already be wrong. The most successful protein structure prediction technique, comparative modelling, is, at its best, only as good as the quality of the structure(s) used as a template to build the model. It is important to remember that experimentally determined biomolecular structures are themselves models, albeit ones constrained by vast numbers of real world data.


Many of the best, knowledge-based bioinformatic methods depend not only on whether or not your uncharacterized query protein has an easily identified homologue, but on the number of members of its family of homologues already available. Unless a very poor method is chosen the quantity and quality of the existing template data are often more important than the modelling program used.

The Process of structure prediction

Here is a possible sensible programme of steps to use in the process of characterizing a newly sequenced gene product:

The Stages of bioinformatic structure prediction


Literature search

It might seem unusual to make the first stage of a bioinformatics investigation a trip to the library, but this part of the journey often turns out to be more important than the others. Researchers in biological fields can avoid many serious errors by simply getting to know the area they are investigating first. Apart from anything else, someone may already have built a model of the your protein(s) of interest.


These days most biomedical literature databases are cross-linked with bioinformatics resources. For example, a paper in which several slightly varying isoforms of a given enzyme are compared biochemically might link to the corresponding entries for their actual protein sequences in a genomic database, or the co-ordinates of their constituent atoms in a structure database.

The Advantages of doing a literature search


The Disadvantages of literature search

Sequence search

If you intend to build a comparative model it is essential to have a known template protein structure against which to model your own query sequence of unknown structure. A good tool for finding potential structure matches, in the absence of structure information about your query sequence, is (PSI)-BLAST. PSI-BLAST is a clever derivative of BLAST, the most commonly used bioinformatics program.


You can visit the NCBIs, BLAST pages for an outline of the PSI-BLAST approach. As you might imagine, this method is at its best when there is a good match for your sequence in the relevant database(s)---and better still when that match is part of a family of similar proteins. Of course, good sequence matches are not a great deal of use for comparative models if there is no structural information on the sequences identified.


The two most common alignment algorithms for sequence search are as follows (in order of popularity)


BLAST

FASTA

Smith-Waterman

Motif search

Perhaps the quickest and "cheapest" discriminator of protein function is motif search. Motifs, short tightly-defined patterns of polypeptide sequence, are both sensitive (few false negatives) and specific (few false positives) identifiers of function.


There are several very useful motif databases, with search facilities. One of the best known is Prosite:


http://ca.expasy.org/prosite/

.

Sequence alignment

Alignment is the most important stage in any comparative modelling process. Not only are alignment algorithms used to find the sequence of a suitable structure template, but, once a good match has been found, the sequence of your query sequence must be lined up with that of the identified template. If the wrong parts of the query protein are aligned with the wrong parts of the target (template) protein then your model-building will be seriously awry.


Heuristic (rule-of-thumb) methods such as BLAST tend to be used to finding matches and complete, dynamic programming, methods such as the Needleman and Wunsch or Smith-Waterman methods used for alignment of matched sequences.


There is an excellent summary by Christopher Dwan of the trade-offs between the BLAST and Smith-Waterman search methods here:


http://www.oreillynet.com/pub/a/network/2001/11/30/speedup.html


. Automated alignment methods, even (especially) those guaranteed to find mathematically optimum alignments of sequences do not necessarily make the best structural alignments. This is one of the main reasons why all the best protein predictions involve human intervention. Before you make any attempt to build a comparative (homology) model you should perform a manual alignment.

Multiple alignment

Secondary structure prediction

One of the few things that scientists in the field of protein folding seem to agree on is that secondary structure forms early on this process.


Identification of secondary structural elements makes the topology of your structure more obvious-so that similar ones can be identified in a topology database such as TOPS. Predicting the positions and lengths of secondary structure elements can be used as a prelude to "docking" these secondary structural elements against each other.


Information about the division of primary structure into secondary structure chunks is valuable in the construction or refinement of primary structure alignments. When homologues are aligned and compared, primary structure differences are far more common than secondary structure differences. Secondary structure can be a better guide to the correct correspondence between parts of two proteins' respective tertiary structures.


Lastly, in the absence of a useful known structural match for your query sequence, a secondary structure prediction can be used to make some kind of intelligent guess about the higher order structure of your protein.

Tertiary structure prediction

WHAT IF as an example comparative (homology) modelling program

Many techniques which have been developed to model protein folds. Many of these require specialized knowledge or are highly variable in their performance. WHAT IF is a suite of bioinformatics programs with a relatively user-friendly and relatively powerful comparative modelling module. This module is both reasonably accurate and can be used by a non structural biologist. Most importantly it gives very good feedback to the user on the quality of the models it creates; it answers the question: How wrong is my answer?


WHAT IF's comparative modelling module is conservative in more than one sense of the word. Given an alignment between the query sequence and a suitable template, WHAT IF substitutes query residues into the appropriate parts of the template sequence. It is up to user to produce a good alignment. As it edits these in and determines their most likely arrangement within the new, template framework, the module leaves the templates conserved residues as unchanged as possible and substitutes changed, query residues according to a database of known fragments from a carefully chosen subset of existing known structures. This subset of known structures is, of course, chosen for its quality.

Refinement

Once a comparative model has been built by analogy with existing structures, using both structural components of the template itself and of other known proteins, physico-chemical modelling methods can be used to correct mistakes in this model. Obvious errors such as putting two atoms in the same space and impossible bond angles can be dealt with first, followed by tackling more subtle problems such as improbably high energy structures. Using techniques such as energy minimization for "improving" the structures of models is dangerous. Frequently it fixes local problems at the expense of creating local errors. Often it just makes a model worse.

Checking

Methods used to check for the quality of template structures when they are originally submitted to structure databases (such as the PDB) by WHAT IF also applies to the models it generates. These techniques are also applied to experimentally obtained structures before their submission to the PDB and are built into WHATCHECK, the structure validation component of WHAT IF.

What if you cant find a homologue?

Sequence search may fail to identify a match against which to model your protein sequence. If your sequence codes actually codes for a novel fold this is good, because any pure comparative modelling effort should be doomed. If there is a protein of similar tertiary structure (fold) to your query protein, but dissimilar primary structure (sequence) it might be identified by a fold recognition program. The mother of all fold-recognition programs is THREADER.


Unfortunately fold recognition is a branch of bioinformatics characterized more by promise and innovation than actual success. The best that can be said is that, with test sequences, Threader, for example, is very good at ranking known structural homologues near the top of its ratings (in the absence of strong sequence match).

Ab initio protein structure prediction

The alternative to threading, when there really is no useful homologue detectable or available, is to build a model of your protein from first principles. Obtaining the three-dimensional structure of a protein solely by applying physico-chemical laws to its atoms---ab initio protein structure prediction---is a problem of provably immense difficulty.


There is an annual competition between protein modelling scientists called CASP (Critical Assessment of Techniques for Protein Structure Prediction)


http://predictioncenter.llnl.gov/


. Its aim is to compare the effectiveness of different bioinformatics groups approaches to protein structure prediction. Modelling groups are given sequences of crystallized, but as-yet-to-be-determined protein structures by experimentalists. The modellers then have to build models of these sequences' tertiary structures blind---that is, in the absence of experimental data.


When, eventually, the blind models are compared to the determined structures comparative/homology modelling is so consistently superior in its results that ab initio modellers compete in a separate part of the competition (as do those devising methods for fold identification).


During the talk I presented an example of ab initio modelling by a system called Rosetta one of the best ab initio methods. It has had some limited success with particular folding problems


Structural genomics

Until recently there have been two main criteria for experimentalists' choice of target:



There are efforts now to provide a more comprehensive knowledge of the sum total of all structures and/or the realm of available protein architectures---foldspace.


Crystallizing proteins is labour intensive and something of a black art. Higher throughput and more systematic approaches are being devised to increase the output of structures in total. If these are combined with comparative modelling then we can add a new criterion when choosing structures for crystallization: how informative a structure is. If we believe a gene codes for a novel fold and/or has multiple homologues of unknown structure, we might place a higher priority on determining its structure---even though it might not code for a product of great biological significance. In this way we could obtain a greater coverage of fold templates from which we could model a much larger number of useful structures.


Summary

Some points to keep in mind

How wrong is my model?

The question you should ask when doing any kind of bioinformatic analysis is (or, indeed, any kind of science) is not "Is my answer wrong?", but "How wrong is my answer?". For some purposes, rough estimates are perfectly useful. If you have used other methods shrewdly protein structure prediction can offer insights unavailable by any other techniques.