home > MRes Biomolecular Sciences > Intro Notes > Lecture 1 > Lecture 2

MRes Biomolecular Sciences Lecture Notes: 1. The Gene and Bioinformatics

Table of Contents
The Biologies of Bioinformatics
The Gene
What is a gene?
The Central Dogma
The Hierarchy of protein structure
Gene sequencing
Genes and computers
Genetic change
Sequence comparison
Sequence search
Summary
Bibliography

This the second of seven sets of notes designed to summarize and supplement the content of a series of lectures given by Damian Counsell at Imperial College London as part of the MRes course in Biomolecular Sciences.

These notes and associated materials will be made available under a modified form of the Open Content Licence.

Damian Counsell is at the Medical Research Council's Human Genome Mapping Project Resource Centre, Cambridge, UK.


The Biologies of Bioinformatics

Bioinformatics has most often been thought of as being an ancillary part of genetics. In fact, biology involves at least four other strands of modern biology (see below). Bioinformatics permits us to work with the vast data these different biologies now create.


Molecular biology

Following the identification of the genetic material as DNA and its characterization-structure determination, elucidation of copying method and the deciphering (not sequencing) of the genetic code---it became possible to manipulate the genetic material of living cells. Both the science involved in these discoveries and the technologies devised with this knowledge to tackle other biological problems are referred to as molecular biology. Perhaps confusingly for the outsider, the molecules in question are almost always nucleic acids. The information read from nucleic acids (sequenced) in the conventional laboratory (the "wet" lab) is stored, manipulated and analysed with bioinformatic methods.


Evolutionary biology

Evolutionary biology seeks to infer the developmental history of species of living things from our theoretical shared common ancestor to the most complex multicellular organisms. It also seeks to understand the mechanisms by which this process of evolution works.

Bioinformatics provides methods by which sequence data can be used to compare the genetic material of living things directly. In the past biologists relied on more subjective measures of differences and similarity between specimens—such as body shape and colour. Bioinformatic methods can be used as a basis for quantifying the relatedness of individuals and populations more objectively and more precisely.


Structural biology

This is the study of the physical forms of molecules and is most commonly identified with structure determination. When they do this, structural biologists use physico-chemical approaches such as X-ray crystallography, Nuclear Magnetic Resonance (NMR) and circular dichroism (CD) to obtain information about the relative positions of atoms within biomolecules.


Genetics

Genetics is one of the oldest strands of biology. Further, the practice of specifically crossing farmed animals and plants—selective breeding—predates the actual scientific study of heredity. Classical genetics sought to describe the mechanisms of inheritance in the absence of direct knowledge of the genetic material, nucleic acids. Contemporary geneticists know the nature and content of this material (thanks to molecular biology), can model the molecular changes which take place in genes over time (evolutionary biology), but can also see the direct molecular connection between such changes in genes and the forms their products take (structural biology).


The Gene


What is a gene?

Darwin

At the end of the 19th century came the first unifying idea of biology: Wallace and Darwin's Theory of Natural Selection. Natural populations vary; some individuals are smaller, some larger, some carry more fat or scales, some less. The essential resources for life—space, food, water—are fixed and finite. Populations, however, increase. When there are no constraints they usually grow geometrically. Darwin reasoned that individuals in a population must eventually compete for these limited resources. Those that succeeded sufficiently to reproduce would pass on their particular variations—their characteristics or traits—to their offspring. Members of the next generation would inherit these traits. While those characteristics best suited to the environment of the population—"the fittest"—would survive, those unsuited would not.

Darwin's theory required, therefore that populations contained variation, the individuals within populations competed for the needs of live and that the offspring of those individuals that succeeded could inherit their parents' appropriate variations.


Mendel

Before Darwin's publication of his ideas the foundations of classical genetics were laid by a Augustinian monk working in the gardens of a monastery in Brünn, now Brno in Czechoslovakia.

One of the many reasons why people resisted Darwin's ideas was because it was generally believed that paternal and maternal inherited characteristics were blended in the shared offspring. That is, a dark rabbit and a light rabbit were expected to produce a rabbit whose fur was a colour intermediate between that of each of its parents.

In searching for the rules of heredity, Mendel shrewdly concentrated upon traits with no intermediate form. Working on peas he examined plants which produced either round or wrinkled seeds or whose flowers were either pink or white, but never a mixture. He kept careful records of the results of the many cross-fertilizations of different pea.

It is important to remember two general conclusions from this work:

  • genes (he called these—then unknown—units "factors") are indivisible units of information inherited from both parents,

  • the maternal and paternal genes are distributed randomly between offspring.



The indivisible nature of genes can account both discrete traits, such as the roundness of pea seeds, and for continuously varying characteristics, such as height or human skin pigmentation.

For the most simple traits each individual inherits one unit (gene) from each parent. For the complex traits each individual inherits several or many genes from each parent.


Garrod

In his book Inborn Errors of Metabolism, Archibald Garrod, a physician, described the link between single genes inherited in a Mendelian way and single-enzyme biochemical disorders such as alkaptonuria and was the first to propose that a gene was responsible for the production of a single protein. By 1941 biochemical investigations of irradiated moulds led Beadle and Tatum to their 1 gene-1 enzyme hypothesis.


Crick and Watson

The classical idea of the gene as the indivisible unit of inheritance was illuminated and altered radically when in 1953 Francis Crick and James Watson first published their proposed structure for the nucleic acid DNA. Then, this molecule was only recently suspected of carrying genetic information in living things. Both Watson and Cricks model and the status of DNA as the genetic material have since been thoroughly confirmed.

In the middle of the 20th century the structure of DNA and subsequent work—much of it involving Crick—provided the second universal principle in biology: the The Central Dogma.

Every previous observation in classical genetics had to be reinterpreted now that we knew the nature of the genetic material, its method of duplication, and the means by which its message was expressed. The tale of the discovery of DNA is itself an exciting detective story outlined in James Watson's indiscreet memoir The Double Helix [Watson99]


The Central Dogma

Now we know the nature of the genetic material we know that the information encoded in DNA, the stuff of genes, is passed on by being copied into (transcribed) another nucleic acid, RNA, and then used as a blueprint (translated) for gene products: proteins. Proteins are the doing molecules of living things; they catalyse reactions, form structures and transmit signals. This flow if information—"DNA makes RNA makes protein"— is the so-called Central Dogma, perhaps so called because, originally it was proposed as "Gospel" on the merest data.

Bioinformatics works because all of the molecules in this chain are polymers, molecules based on a regular repeating structure. These polymers differ from those you might have encountered previously in chemistry in that all the individual repeating units share a general plan, but they are different from each other. While condensing into macromolecules by the same reaction and occupying the same distance along the length of those parent polymers, the individual types monomer have unique properties. These types—the four different nucleotides of nucleic acids and the twenty different peptides of proteins—can be thought of as differently coloured (but similarly linked) beads in a chain. While these "beads" might bulge out in different ways from their connecting thread they all have the same thickness. To a computer—more accurately: to a computer programmer—these monomers are letters in an alphabet, assembled in specific order into polymer "volumes".

The Central Dogma is perhaps the most important (the only?) unifying principle in biology since Darwinism. It is for from universally true, however; it merely provides a framework for thinking about the flow of genetic information. Study of the many exceptions—retroviruses, jumping genes, ribozymes—is always illuminated by observing how these phenomena break the rule "DNA makes RNA makes protein''.

The Central Dogma and the nature of protein folding imply that molecular structure is defined solely by gene sequence. Biochemistry has shown us that, broadly, most complex biological molecules are made of simpler molecules arranged in a defined, modular way and that each level of molecular organisation is defined by the previous one. This fact holds the scientific and medical promise of molecular biology. It is also why technologies which might shortcut this process, such as bioinformatics, generate so much excitement.


Gene expression and disease

Because of the flow of information depicted in the Central Dogma errors in DNA are propagated faithfully into errors in protein. These errors can lead to death (mortality) and disease (morbidity). In developed countries genetic disease is a major cause of both. Slide 13 depicts the transcription of a single faulty monomer from a DNA molecule to a messenger RNA molecule to a translated protein.


DNA structure

The next slide shows the repeating ladder framework of the DNA molecule. The crucial thing to remember is that DNA is a double helix. Each helix is joined to the other by "rungs" of chemical bases. These are represented computationally by the letters A, C, G and T. Each rung in the ladder consists of two bases juxtaposed by non-covalent (hydrogen) bonds and each base has a complementary partner with which it always pairs. If you can read one half of a DNA molecule, you can read the other—complemented and reversed.


Chromosome structure

The 3000 million rungs of your DNA are packed into a membranous bag. Each of your cells has such a bag or nucleus at its core. The packages are called chromosomes. They contain supercoiled (coiled, coiled) DNA wrapped around various other molecules in structures so bulky and dense that they can be seen under a light microscope.


The Genetic code

Slide 16 shows the genetic code, the equivalence between specific combinations of three-letter nucleotide words (triplets called codons) and peptide letters. Any molecular biology textbook will explain this code and its elucidation in detail. Note that:

  • more than one codon in DNA can specify the same peptide in the constructed protein; the code is redundant,

  • codons don't just specify peptides but can tell the protein synthesizing machinery to stop.



Remember that the primary structure of a gene product is usually completely specified by the gene that codes for it.


The Hierarchy of protein structure


Primary structure

The sequence of the letters of a particular polypeptide is called the primary structure of a protein. There are twenty different letters in the protein alphabet. These may be connected together in virtually any order. This primary structure is held together by covalent (peptide) bonds between immediately adjacent residues. The central carbon of each monomer residue, about which the chain of the necklace can rotate is called the alpha carbon.


Secondary structure

The characteristic repeating structural arrangements between neighbouring peptide residues (whether adjacent within a strand or between quite separate stretches of the same chain) are called secondary structure. There are three broad categories of secondary structural motif:

  • helices,

  • sheets, and

  • loops



They can be visualized by linking the alpha-carbon atoms of a three-dimensional structure with lines and by smoothing the resulting graph. These motifs are maintained by non-covalent (hydrogen) bonds between nearby residues.


Tertiary structure

The tertiary structure of a protein is the way in which these secondary structural elements are arranged, or folded, to form a stable, functional protein.


Quaternary structure

Multiple folded proteins can be (and, in vivo usually are) arranged into assemblies of  more than one subunit. Their arrangement is referred to as their quaternary structure.


How The Dogma is made flesh

This protein structure hierarchy can be fitted into the Central Dogma—see slide 21.


Gene sequencing

The technology of DNA sequencing is becoming less of an immediate concern for the typical practising biologist. It is often said that we are entering the "post-genomic era" in which the questions about how we acquire sequence data will (for most biologists) be more strategic and political than scientific.

The genome of an organism is not read by putting a sample of its tissue in a machine and turning a handle. In the early stages its genetic material must be "mapped". That is certain well-characterized genes or physical markers must be located on their respective chromosomes and along the length of those chromosomes. This mapping may be physical when the literal distance between genes is plotted or genetic when the likelihood of two gene elements being inherited together is used to guess how close they are. Although the main public sequencing effort in the Human Genome Project has been based on extensive maps, the biggest part of the recent growth in sequence data has come from overlaying more rapidly acquired data onto pre-existing outlines

The fast, "brute force" sequencing technique of "shotgun cloning" and assembly begins by randomly fragmenting the DNA to be sequenced. Then these fragments are sub-cloned (identically copied) into easily manipulated organisms like yeast. The newly "hijacked" hosts are then grown up in quantity. They are tricked into copying the added DNA with their own. The hoodwinked cells are broken open and the fragments of interest are amplified (copied many times) with a technique called the Polymerase Chain Reaction (PCR) to be sequenced.


How are nucleic acid fragments sequenced?

Complementary stretches of DNA stick together. A short stretch of "primer" sequence, complementary to the DNA we wish to sequence, is used to start a reaction catalysed by a DNA polymerase. This reaction involves the extension of a new DNA strand against the existing template, a single strand of the helix.

Four reaction mixtures are prepared, each containing a quantity of a faulty version of one of the nucleotides so that the reaction will stop and an incomplete strand fall off at various points where that faulty base appears. The terminating bases can are labelled with machine-readable coloured tags. The reaction mixtures can then be separated in an electric field according to their lengths. The lengths measured are proportional to the distances that particular letter is found from the start of the message. These relative distances can be read off by robots and stored on disk for assembly into full sequences.

This crude outline does not do justice to the complexity of a major sequencing effort and omits the important issue of the management of the data such sequencing generates. Two important problems with the process have serious bioinformatics implications.


Vector contamination

Sequence data from the organisms used to transfer and host cloned fragments of the actual DNA to be sequenced is still present as corruption in many gene databases. Bioinformatics can be used to screen out this vector DNA. Programs can be provided with a set of DNA sequences known to be found in vector organisms, check any supposedly new sequence data for the presence of these sequences and remove it.


Repetitive DNA

Shotgun methods were widely adopted by sequencing projects after their enormous success in completing small genomes. The human genome is different from that of bacteria and many other single-celled organisms. Not only is much of it nonsense, or "selfish" DNA which does not code for proteins or nucleic acid products of any known function, but that nonsense DNA contains many, extensive regions of repetitive sequence. The process of matching overlapping short stretches of such repetitive sequence is computationally very challenging—like reassembling the torn pages of books whose intact text contained long passages where the only words were "rhubarb" and "custard".


How are gene sequences put together?

Sequence assembly is a classic bioinformatics problem. Once a stretch of DNA has been sequenced by this technique we have lots of short stretches of our genetic "book", but no clear guide as to how they fit together. The aim of sequencing is to obtain unbroken readings of adjacent parts of the original sequence, contiguous stretches of DNA data. One of the most important bioinformatics tasks has been the amalgamation of such contigs (for "contiguous segments") from short sequenced fragments.

The computational problem is then one of assembly putting the pieces together in the most likely order by juxtaposing the overlapping fragments in such a way as to assemble complete stretches of DNA.

The original purpose of one of the most popular bioinformatics software packages, the so-called Staden programs was to perform this contig assembly for small to medium-scale sequencing projects. With a system like this fragments of DNA can be assembled automatically, but, even at this low level, the importance of human intervention in an otherwise automatic bioinformatics process is crucial. A user can monitor the whole process at a variety of levels. (S)he can "drill down" from the bird's eye view of the fragments as rectangular blocks to the sequence level itself.

It is not possible to assemble a genome as large as the human genome without an existing map. This question of assembly has become the latest of many areas of contention between the public [WaterstonEtAl02] and private [MyersEtAl02] human genome sequencing efforts .


Genes and computers

Computing power is growing faster than finished sequence data. Raw processing "muscle" continues to out-run the size of our collection of completed sequences—though not the scale of the bioinformatics tasks we can devise.

While preparing a speech in 1965, Gordon Moore, hardware engineer and one of the founders of the Intel Corporation, observed that when he plotted the growth in computer memory performance, each new chip contained roughly twice as much capacity as its predecessor, and each chip was released within 18-24 months of the previous chip. If this trend continued, he reasoned, computing power would rise exponentially. Moore's observation that transistor density doubled roughly every 18 months is now known as Moore's Law. It has been true for over 26 years. For Intel processors this geometric progression as taken the density of transistors on a chip from 2,300 on the Intel 4004 in 1971 to 7.5 million on the Intel "Pentium II" processor.

There is a limit to the size of the human genome. It will not grow significantly over the course of generations. Meanwhile every so-called physical barrier to the growth of computing power has been sidestepped by the development of new technologies. And computers are so cheap, common and interconnected that really difficult problems can be tackled by building clusters of networked machines or by harnessing spare processor cycles on individual users' desktop machines scattered across the planet.


Genetic change

In this part of the lecture (slides 28-31) I simply outlined the many different forms of genetic change and mutation occur naturally. It is not necessary to memorize this biological information. Simply understand that there are many ways genetic information can vary or become corrupted in the process of copying. In fact, it is common for a mutation in DNA to have negligible or no effect on any protein. This may happen because the mutated DNA for example does not code for a protein (true for well over 90% of our, human, DNA), because it changes a codon to another one coding for the same peptide residue (note the use of the word "redundant" above when the genetic code was described) or because the resulting codon change codes for a sufficiently similar residue rather than a markedly different one or stop codon. Even if such a mutation shows itself in a gene product, that is, it is expressed, this does not mean that it results in any advantage or disadvantage to the individual carrying it; it may be selectively neutral


Sequence comparison

It is often important to biomedical scientists to know how related two genes or proteins are to each other. Classically in biology, when there is no way of testing relatedness directly, this has been done by holding corresponding structures (for example beaks) side-by-side and comparing them. The equivalent method of examining sequences for relatedness is to align them.


How do you compare sequences?

Imagine you are interested in comparing the same enzyme or structural protein in two different species. Say, you are interested in the gene (nucleic acid sequence) or gene product (polypeptide sequence) of haemoglobin in mouse and human. The letters representing their constitutent residues can be matched up and the juxtaposed pairs of base or residue symbols (those letters) are scored according to how "similar" they are considered to be. I will talk later about the ways in which this similarity is quantified.

The scoring systems most commonly used today to decide this are empirical, that is they depend on looking at previous alignments and calculating how often each particular base/residue type substitutes for each other base/residue type. In similarity comparisons, higher scores are given to letters which are most frequently swapped for each other. In turn, in computational alignment programs these scores are used to calculate the optimum alignment between two sequences. That is, the one which gives the highest scores, either locally, in discrete regions where two sequences are most similar, or globally, over the full extents of the two sequences.


What is a sequence alignment?

Alignments of related sequences are recorded as traces. The trace of an alignment is a symbolic representation of the correspondence between one or more compared sequences. Each base (DNA) or residue (protein) monomer is represented by a letter of the alphabet. These are assumed to be evenly spaced along a molecule's length—the single dimension of its primary structure. In standard notation the single-letter codes are printed in order, in a fixed font.

Although alignments are often performed in order to estimate whether or not two sequences are related through a common functional ancestor (that is, to determine whether or not they are homologous), sequence alignments are built on the assumption that the two sequences in question are descended from a common ancestor.

When a residue in one of two aligned sequences is identical to its counterpart in the other the corresponding amino-acid letter codes are vertically aligned: a match.

When a residue appears to have been deleted since the divergence of the two sequences, its "absence" is labelled in the other sequence by a dash in a corresponding position.

When a residue appears to have been inserted into an "original" a dash appears opposite in the "unchanged" sequence.

Since the dashes referred to above correspond to "gaps" in one or other sequence, the business of inserting such spacers is known as gapping.

When one sequence is gapped relative to another, a deletion in sequence a can be seen as an insertion in sequence b. Their relationship is symmetric. Indeed, the two types of mutation are referred to together as indels. If we imagine that at some point one of the sequences was identical to its primitive homologue, then a trace can represent the three ways divergence could occur.

A trace can represent a substitution:

         
           AKVAIL
           AKIAIL
         
      


A trace can represent a deletion:

         
           VCGMD
           VC-GD
         
      


A trace can represent an insertion

         
           GS-K
           GSGK
         
      


For obvious reasons I do not represent a silent mutation (see above). Equally, the genetic changes which are represented may themselves obscure earlier changes. Good alignment methods must allow for this possibility.

This kind of diagram best represents genetic differences due to point mutations—single letter changes.— As I have explained, real mutations often insert or delete several residues. If an alignment method assumes a multiple-residue (that is, multiple-base) indel is caused by the accumulation of single mutations then it will fail to identify biologically correct alignments.


Alignment is the central problem in bioinformatics

Good alignments are essential for all these bioinformatics tasks and more:

  • searching for matching sequences and/or homologues,

  • comparing sequences,

  • building models of protein structures,

  • molecular phylogenetics




Sequence search

Currently, a lot of bioinformatics work is concerned with the technology of databases. These databases include both public repositories of gene data, for example GenBank or the Protein DataBank, and private databases belonging to individual research groups involved in gene mapping projects or those held by biotech companies. Making such databases accessible via open standards like those of the Web is essential since biologists tend to use a range of computer platforms.

The EMBL/DDBJ/GenBank databases are the most comprehensive and widely used of all the sequence databases. The three main country bases, in the UK, Japan and the USA respectively, collaborate in keeping their "master" databases of public nucleic acid data as an up-to-date pool. There are many different file formats for the storage of sequence data.

Generally, human-readable entries that can be printed out onto paper without complicated layout and nesting are referred to as "flat file" formats.

Usually there is a header section at the top of such a file with a description of the file, the species it came from, the way the sequence was obtained, the sequence itself of course, and information about the authors of the original publication in which sequence first appeared—plus a variety of other useful metadata (information about the information). Fortunately the standard EMBL flat-file format is labelled fairly clearly and there is a friendly and informative interface to the whole database called SRS (slide 44) where all of the header fields are described in more detail. If you have a look over this this once to get a feel for the various ways the database might be queried.


Summary




Bibliography

Books

[Watson99] James Watson and Steve Jones, 1999, 0140268774, Penguin, The Double Helix.


Journal articles

[WaterstonEtAl02] Proceedings of National Academy of Sciences, "On the sequencing of the human genome", Robert S. Waterston, Eric H. Lander, John E. Sulston, 2002, 99, 3712-3716.

[MyersEtAl02] Proceedings of National Academy of Sciences, "On the sequencing of the human genome", Eugene W. Myers, Granger G. Sutton, Hamilton O. Smith, Mark D. Adams, J. Craig Venter, 2002, 99, 4145-4146.


home > MRes Biomolecular Sciences > Intro Notes > Lecture 1 > Lecture 2