PredictBias

identification of genomic and pathogenicity islands in prokaryotic genome
Home | Help | Analyzed genomes
 
 
Bias analysis

PredictBias uses a cluster of six ORFs for calculating %GC bias, dinucleotide bias and codon bias. Cluster of six ORFs is taken consecutively for whole genome by using a sliding window shifting by one ORF at a time. Our dinucleotide and codon bias analysis method is based on formulas published by Samuel Karlin [1].

%GC bias:

%GC Bias (Cluster) = %GC (Cluster) - %GC (Genome)

                      *Each ORF cluster having significant %GC bias ( |%GC bias| >3.5) from the genome %GC is marked on the tabular output.

Dinucleotide bias:

                              Where;

                                       δ* (F, G) = dinucleotide relative abundance difference or dinucleotide bias.

                                       ρ* (F) = dinucleotide relative abundance profile for all ORFs and their reverse complements in a  gene cluster.

                                       ρ*xy (G) = dinucleotide relative abundance profile for all ORFs and their reverse complements in a  genome.

                                       ρ*xy = f*xy / f*x f*, where f*x  is the frequency of mononucleotide x and f*xy is the frequency of dinucleotide xy.

                      *Each ORF cluster having significant dinucleotide bias deviation (>3) from the mean is marked on the tabular output.

DNBias dev.= DNBias (Cluster) - Mean DNBias (Genome) 

Codon bias:

                       Where;

                                       B(F|G) = Codon bias of gene cluster relative to genome.

                                       pa (F) = Average amino acid frequencies in ORF cluster F.

                                       f(x, y, z) = average codon frequencies for the nucleotide triplets (x,y,z) in ORF cluster f , normalized to 1
                                                        in each amino-acid family (all codons translated to amino acid a).

                                      g(x, y, z) = average codon frequencies for the nucleotide triplets (x,y,z) in genome g, normalized to 1
                                                        in each amino-acid family (all codons translated to amino acid a).

                      *Each ORF cluster having significant codon bias deviation (>6) from the mean is marked on the tabular output.

CDNBias dev.= CDNBias (Cluster) - Mean CDNBias (Genome) 

Insertion sites:

Some tRNA genes represent hot spots for the integration of foreign DNA including PAIs. PredictBias represents the tRNA position along the genome in both auto and manual analysis mode. In auto mode, a potential GI or PAI containing tRNA is marked 'Y' corresponding to Insertion element column in output result. In manual mode, a vertical line with superscript tRNA along the bar plot represents the position of tRNA.

Mobility factors:

PAIs often carry cryptic or functional genes such as phage-like integrase genes or genes for transposase. PredictBias represents the position of integrase and transposase along the genome in both auto and manual analysis mode. In auto mode, a potential GI or PAI containing tRNA is marked 'Y' corresponding to Insertion element column in output result. In graphical display, the position of integrase and transposase is represented by a veritical line with superscript integrase and transposase respectively.

Note: Location of tRNAs, transposases and integrases is determined from the input Genbank genome file by keyword search.

Genomes analyzed


It is the parent page displaying the list of organisms for which bias analysis is available. It can be accessed by clicking Genomes analyzed at the top of home page. The results are displayed in a table comprising of ten columns as shown in Fig 1


Fig-1

  1. Organism: The name of Organism for which bias analysis is done.

  2. King: The Kingdom to which the Organism belong (B=Bacteria, A=Archaea).

  3. Length: Length of the nucleotide genome sequence.

  4. ORFs: Number of ORFs in the genome (Genes, pseuogenes, undefined genes, tRNA, rRNA, misc. RNA). Only protein coding ORFs are used for bias analysis.

  5. %GC: G+C percentage in genome.

  6. Mean %GC bias: Mean %GC bias for all gene clusters (6 consecutive ORFs) in the genome e.g. for 4214 ORFs in a genome there will be 4209 gene clusters.

                                           Mean %GC bias =
    %GC bias (Cluster) / Number of clusters
                                                                        All clusters

  7. Mean Dinucleotide bias (DNBias): It is the Mean Dinucleotide bias of all gene clusters.

                                           Mean DNBias = DNBias (Cluster) / Number of clusters 
                                                                    
    All clusters

  8. Mean Dinucleotide bias deviation (DNBias dev): It is the mean deviation (MD) of all gene clusters from the mean dinucleotide bias.

                                           Mean DNBias dev =
    ( DNBias (Cluster) - Mean DNBias ) / Number of clusters
                                                                             All clusters

  9. Mean Codon bias (CDNBias): It is the Mean Codon bias of all gene clusters.

                                           Mean CDNBias =
    CDNBias (Cluster) / Number of clusters
                                                                      All clusters

  10. Mean Codon bias deviation (CDNBias dev): It is the mean deviation (MD) of all gene clusters from the mean codon bias.

                                           Mean CDNBias dev =
    ( CDNBias (Cluster) - Mean CDNBias ) / Number of clusters
                                                                               All clusters

  11. RefSeq Id: It is the unique identifier for each entry in the NCBI RefSeq database. Complete bias analysis for an organism can be accessed by clicking on corresponding RefSeq Id.

Bias results


PredictBias provides two modes for the identification of genomic and pathogenicity islands.

  1. Auto analysis: It is the default mode of PredictBias. PredictBias examines for consecutive ORF cluster (>=6) with codon bias deviation and either of the %GC bias or dinucleotide bias deviation above the threshold value and marks first ORF of each cluster as part of a Genomic Islands in output result.

    Bias results are displayed in tabular format with Potential Island’s start region, end region, bias values, similarity with virulence profile, presence of Insertion elements and prediction result displayed in adjacent columns (Fig 2). User can change the threshold values of %GC, dinucleotide or codon bias for more stringent search. Besides, Compare genome feature is also available where one can select a potential island and compare it in a related non-pathogenic species. It should be noted that compare genome feature is meant to investigate the relative arrangement of a region in non-pathogenic species. and should not be assumed as a tool for whole genome comparisons. To assist in the identification of related non-pathogenic species, a phylogenetic tree of bacterial species is also available.

Fig-2: PredictBias results for Escherichia coli strain 536.
 

  1. Manual analysis: In default mode, PredictBias ignores region with less than 6 ORFs having significant bias as earlier analysis have shown that focusing on GIs rather than individual putative alien genes in a genome assist in reducing false positive results without missing relevant HGT events [2]. To aid in the detection of small islands, PredictBias provides a bar plot representing the composition bias (y-axis) for each ORF cluster (x-axis) along the genome. Bar plot representation assists in distinguishing regions having significant bias from insignificant ones that is very crucial during the detection of small islands like P4 ‘prophage’ in E. coli 536 (Fig. 3).

Fig-3: PredictBias results for 'P4 prophage' in E. coli 536 genome.

 

Compare Genome

Regions in the bacterial genome showing significant bias can be compared against a closely related organism through this feature. 'Compare genome feature' is available in both auto and manual mode of analysis in PredictBias. In manual mode, 'start locus tag' and 'end locus tag' corresponding to the gene at the start and end region of interest is required. To know about the locus tag corresponding to a gene cluster, Click on a bar in the bar plot, a popup window appears as shown in fig-4. Click on set as locus tag (start), if the gene is to be set as the start point for genome comparison and set as locus tag (end), if to be set as the end point.


Fig-4: Popup window displaying the locus tag corresponding to the first gene in Cluster 856.
 

Select the Genome against which comparison is to be carried out from the drop-down box and click Go. The results showing the longest continuous stretches of genes homologous/similar to the query region are displayed as shown in fig 5.

Fig-5: PredictBias results for genome segment comparison of Burkholderia mallei ATCC 23344 with Burkholderia Thailandensis E264

Fig-6: PredictBias results for genome segment comparison of Burkholderia mallei ATCC 23344 with Burkholderia Thailandensis E264

 

New bias analysis


This feature is useful if bias analysis for a bacterial genome is not available at PredictBias. User can submit an input genome file in Genbank format and click submit. The Bias results for the query genome will be displayed in both Tabular and Graphical format.

GenBank format:

References:
 
  1. Karlin, S. (2001) Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends Microbiol. 9, 335-343.
  2. Waack, S., Keller, O., Asper, R., Brodag, T., Damm, C., Fricke, W. F., Surovcik, K., Meinicke, P. and Merkl, R. (2006) Score-based prediction of genomic islands in prokaryotic genomes using hidden Markov models. BMC Bioinformatics 7, 142.
 
Contact Sachin Pundhir for Bugs/Comments.
For best view 1024 x 768 resolution & IE 6.0 or above recommended.