Main»Home Page

Home Page

Description

The goal of the Q Assembler project is to develop software for assembling multiple viral quasispecies genomes sequenced in parallel using recently developed sequencing-by-synthesis technology [1]. Q Assembler is a collaboration between the Public Health Agency of Canada's National Microbiology Laboratory, Canada's Michael Smith Genome Sciences Centre, the University of Manitoba, the University of Baltimore, the University of Pittsburgh and 454 Life Sciences.

Background

Many important viruses, such as HIV, the SARS Coronavirus, Hepatitis C, and the Influenza virus, possess high mutation, recombination, and replication rates. These viruses generate "clouds" of sequence variants called viral quasispecies within infected hosts. Diversity and evolution of viral quasispecies are influenced by host-viral interactions [2]. Characterization of quasispecies genome populations from infected individuals is a first step to study such interactions. A recent proof-of-concept study by researchers at 454, CuraGen, and Yale, suggests that parallel sequencing and identification of the sequence variation present within a population of viral quasispecies is feasible [3] using the sequencing-by-synthesis technology recently developed by 454 Life Sciences and incorporated in the GS20 sequencer. In order to realize the potential for sequencing and assembly of quasispecies populations using this technology, it is necessary to develop and validate a robust methodology for genome-scale quasispecies assembly. We expect the bulk of the challenge will lie in the design and construction of the quasispecies assembler.

The Quasispeices Assembly Problem

Assembling and characterizing any quasispecies genome population poses a substantial computational challenge [4],[5]. Current assembly programs such as Phred/Phrap, TIGR Assembler, and 454's Newbler Assembler are designed to connect reads into a single consensus sequence. As such, they are not appropriate for simultaneously assembling multiple genome sequences. These programs assume, for example, that base mismatches represent base-calling errors or internal repeats rather than legitimate sequence variation from a population of input sequences. In addition, assembly is complicated by rearrangements and the existence of true internal repeats, making the problem of connecting fragments into correct genomic sequences a highly challenging one. The deep coverage capability of the GS20™ can aid greatly in addressing the former problem; however, the GS20's limited unidirectional read length (at ~100 bp per read), and lack of mate-pair information presents a serious challenge in dealing with the latter problem. Any quasispecies sequence assembler for application with the GS20 sequencing technology must take these factors into account.

Assembly Strategy

The problem of simultaneously assembling multiple highly similar, yet distinct genome sequences is not novel. Indeed, this situation is encountered routinely in determining the haplotype of diploid eukaryotic DNA (i.e., the mapping of polymorphisms to the correct chromosome). In regions where sufficient sequence variation exists between reads, a technique known as correlated differences can be applied to segregate the two distinct sequences [6],[7]. This technique uses repeatedly occurring high quality base call mismatches to segregate and connect sequencing reads. The same strategy can be applied to the separation of quasispecies sequences, although in general the sequences can only be effectively separated to a degree owing to existence of intervening stretches of highly similar sequence that break the connection between variable regions. In addition, the greater number of quasispecies assembly relative to haplotyping, and the lack of foreknowlege about the total number of members present in the quasispecies population will compound the difficulty of applying this technique to resolving viral quasispecies sequences.

Our current thinking is that comparative assembly offers the most promising approach to tackle this problem. In this procedure, sequence reads are aligned to a reference genome rather than being assembled de novo using the standard overlap-layout-consensus paradigm. Quasispecies sequences obviously are too diverse to align to any single reference genome, so instead we propose to modify the comparative assembly method with a "phylogenetic partitioning" step: input reads would be aligned initially to a representative sequence from each major clade. Each group of reads would then be realigned to subtypes of said clade, etc. Our intial studies suggest that this approach can successfully segregate the reads into groups that approximately represent their parent genomes, where the final assembly can occur.

Reference and Test Data

Currently we have obtained GS20 sequence reads from overlapping PCR products spanning the entire HIV genomes of two individuals. We will develop the assembly strategy and methodology using the GS20 sequence reads from the overlapping PCR products spanning the entire HIV genomes of these two individuals. To aid the testing and validation of our methodology, the NML HIV and Human Genetics Laboratory has PCR products of the HIV gag region (including part of 5'-LTR and part of protease, ~2kb in length) and fully sequenced clones (30 to 90 clones per sample) of the same PCR products from more than 200 patient samples. These samples represent diverse HIV subtypes, from mostly clade A, D, C and recombinant subtypes and were sequenced using standard Sanger sequencing methodology. The methods and strategy we develop will be tested by sequencing the same PCR products from HIV gag region using the GS20 sequencer. The quasispecies genomes assembled using the developed GS20 methods will be compared with and validated against the cloned sequences.

Development Team

  • National Microbiology Laboratory / University of Manitoba
    • Morag Graham
    • Ma Luo
    • Ben Liang
    • Gary Van Domselaar
    • Michael Domaratzki
    • Shuan Tyler
  • TIGR / University of Maryland / University of Pittsburgh:
    • Elodie Ghedin
    • Mihai Pop
    • Steven Salzberg
  • Michael Smith Genome Sciences Centre:
    • Steven Jones
    • Asim Siddiqui
    • Matthew Bainbridge
    • Rene Warren
  • 454 Life Sciences / Roche
    • Lei Du
    • Jolene Osterberger

For developer-only content, click here

Status

Q Assembler is pre-alpha. There are currently no releases.

License

Q Assembler is being developed under the GNU General Public License

Contact

Questions, comments, and requests to participate should be directed to:

Gary Van Domselaar, PhD
Head of Bioinformatics
National Microbiology Laboratory
Public Health Agency of Canada
1015 Arlington St., Winnipeg, MB, Canada R3E 3R2

Suite H-3570
Phone: +1 204 784 5994
Fax: +1 204 789 2018
gary_van_domselaar [at] phac-aspc.gc.ca
gary.vandomselaar [at] gmail.com

References

  1. Margulies M. et al. (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature. 437:326-7.

  2. Domingo E, Baranowski E, Ruiz-Jarabo CM, Martin-Hernandez AM, Saiz JC, Escarmis C. (1998) Quasispecies Structure and Persistence of RNA Viruses. Emerg Infect Dis. 4:521-7.

  3. Simons JF et al. (2005) Ultra-Deep sequencing of HIV from Drug Resistant Patients. XIV International HIV Drug Resistance Workshop. Quebec City, Canada, June 7-11, 2005

  4. Chen K, Pachter L (2005) Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Comp. Biol. 1: e24.

  5. Edwards RA, Rohwer F. (2005) Viral metagenomics. Nat. Rev. Microbiol. 6:504-10.

  6. M. Pop. Shotgun sequence assembly. Advances in Computers vol. 60, M. Zelkowitz ed. June 2004.

  7. Lancia G. et al. (2001) SNPs, problems, complexity and algorithms. in: 9th Annual European Symposium on Algorithms (BRICS), University of Aarhus, Denmark.