CRE distribution analysis Module



Outline of steps performed by CRE distribution analysis module:

The program scans 1Kb upstream region of all 22229 arabidopsis genes, included in AtREA (reference set), with the user provided CRE and calculates the percentage of genes in the reference set which contain the user provided CRE in their upstream region. The program then scans upstream sequence of genes from each functional or expression defined class (from the selected category) with the CRE and selects out classes in which the percentage of CRE containing genes is greater than that in the reference set. For each class the program then calculates the probablity of finding eqaul or greater number of CRE containing genes in the class by chance (cumulative hypergeometric P-value). The results thus obtained are sorted by P-values and classes in which the P-value of occurence of CRE containing genes is lower than the user selected P-value threshold are reported.


CRE formats  

Cis regulatory elements in CRE distribution analysis module of AtREA can be provided in four different formats .The first step in running this module is to select one of the four formats from the drop down menu.
In addition to consensus sequences based the four basic nucleotides i.e A,C,G and T, AtREA also accepts the following nucleotide codes.

Code

Implication

M

A or C

R

A or G

W

A or T

S

C or G

Y

C or T

K

G or T

V

A or C or G

H

A or C or T

D

A or G or T

B

C or G or T

N

G or A or T or C



Consensus Sequence

The CRE sequence (for example RCCGAC) to be analysed should be entered into the consensus input box. In addition to user defined CREs known CRE sequences from different CRE databases like PLACE and ATCISDB can be also analysed using AtREA. Links to the list of known CREs from these databases have been included in each search page of AtREA.


 
Consensus

 

 

Position Frequency Matrices


For position frequncy matrices the first line of the input CRE should start with a ">" sign and contain a matrix name. Each of the following lines represent a base position of the CRE(starting from 1) .The lines should be separated by newlines (\n) and should contain four columns (representing the frequency of occurrence of A,C,G and T nucleotides(from left to right).The columns should be separated by a single blank space.

The matrix given in the example represents ACGTGS motif.The first line contains the name ("test_motif"), the second line contains the occurrence frequency of A(10) C(0) G(0) and T(0) nucleotides at the first base position.The third line similarly show the frequency of occurrence of A,C,G and T nucleotides at the second position which in this case are 0,10,0 and 0.

Matrix search also requires a similarity score cutoff to be selected from the "Matrix score cutoff" menu. The value range 0.6 to 1. Based on this cutoff the program filters out all matches which score greater than cutoff score(maximum possible score X selected cuotff value).

In case of our example matrix, "test_motif", the maximum possible match score is 55 and therefore selection of 0.9 cutoff will make the program to consider only those matches that score above 49.5 i.e ACGTGG(55), ACGTGC(55), ACGTGA (50)and ACGTGT(50) whereas the use of 0.8 cutoff will make the program to accept all matches that score above 44 which in this case results in selection of 34 matches (AAGTGC,AAGTGG,ACATGC,ACATGG,ACCTGC,ACCTGG,ACGAGC,ACGAGG,ACGCGC,ACGCGG,ACGGGC,ACGGGG,ACGTAC, ACGTAG,ACGTCC,ACGTCG,ACGTGA,ACGTGC,ACGTGG,ACGTGT,ACGTTC,ACGTTG,ACTTGC,ACTTGG,AGGTGC,AGGTGG,
ATGTGC,ATGTGG,CCGTGC,CCGTGG,GCGTGC,GCGTGG,TCGTGC,TCGTGG.).




Position Frequency matrix



Matrix score cutoff:
  

CRE pairs

The input CRE can be also supplied as structured pairs, where the first and the second CRE as well as the minimun and maximum distance between them can be specified.



CRE Pairs

First CRE consensus sequence    :

Second CRE consensus sequence :

Minimum allowed distance:


Miximum allowed distance:


 


Multiple CRE consensus sequences

AtREA can also recieve a combination of different CREs as input. In the current version only CREs in the consensus format are allowed for multiple CRE search.The individual CREs require to be separated by hash(#) sign. The program in this case scans sequences for individula CREs and selects out sequences which show co occurrence of all the input CREs.

Multiple CREs 

 





Classes

Five different ontology categories can be analysed for the enrichment of the user provided CRE .

Category header
Ontology
GOBP Gene Ontology Biological Process
GOMF Gene Onltology Molecular Function
GOCC Gene Onlolotgy Cellular Component
MIPS MIPS FUNCAT classes
ARACYC ARACYC pathways


For each of these categories,all classes from the ontology group which contain at least 8 genes can be studied by selecting the the corresponding category header from the  "class category menu".

In  addition to ontology classes genes induced/repressed under different conditions identifed from different microarray slides can be also anlaysed using AtrREA (please see documentation for details). The option "induced" can be used to study the over-representation of the user provided CRE in induced genes from each of the slides in the expression dataset. The selection of "repressed" option from the class category menu similarly analyses the overrepresntation of the provided CRE in repressed genes from each of the slides.

As the characterization of genes as induced on repressed depends on a expression score cutoff the use of induced and repressed option require a expression cutoff to be selected from the "Fold Cutoff" menu. The options include score cutoffs in log 2 format and range from 1.2 to 4 (i.e. ~1.4 to 16 folds). The selection of  the option "induced"  from the category menu and  2 from fold cutoff menu  therefore analyses the distribution of the CRE in genes which show 4 fold or greater induction in each slide whereas the selection of  "repressed" option from the category menu and 2 from fold cutoff menu analyses the distribution of the CRE in genes which show 4 fold or greater repression in each slide from the expression dataset.



   Fold Cutoff (only for expression classes) :

 




Sequence Features

To incorporate strand, position or frequency preference (if already known) information in the CRE distribution anlaysis these features can be specified from the sequence features menu.

  • Position Windows: In addition to the entire 1kb upstream region (from TSS) which is the default(0-1000) five different position windows can be be selected from this menu.The position window 800-1000 is closest to the TSS weheras the window 0-200 is most distant from the TSS.

  • Strand- The options in the strand menu include coding,reverse and both. When the both strands option is selected the program searches both the strands for the input CRE and reports presence when the CRE is detected in any of the strands.

  • Minimum frequency: In AtREA a promoter is assigned one of the two states 0 or 1 depending on presence or absence of the user provided CRE. The minimum frequency option has been included to enable users to select the minimum number of  occurrences of CRE in a promoter which if satisfied will make AtREA to consider it as CRE containing promoter(state 1) . The selction of minimum freqeuncy 4 in this option will make ATREA consider promoters which contain at least 4 occurrences of CRE to be CRE containing(state 1) whereas promoters containing 1,2 or 3 CREs will be treared as CRE less promoters (state 0) . The options in this menu includes values from 1 to 8 while 1 is the default value.
NOTE : The choice of minimum frequency should be made on basis of CRE length or information related to impact of multiple occurrences of CRE in a promoter sequence. The results of "CRE Feature analysis" module may also provide a indication about the impact of multiple occurrences of a CRE in a promoter.



Analysis of Enrichment and P-vlaue cutoff


The enrichment of a CRE in a class is estimated on the basis of hypergeometric test (for a brief introduction on hypergeomteric test see this link). For each class (from the selected category) the program calculates the cumulative hypergeometric probablity (the probablity of observing the same or greater number of CRE containing genes in a randomly selected gene set of same size) of occurrence of CRE containing genes. As multiple classes are anlysed simulatenously the P-values obtained by hyperprgeometric test are adjusted using Bonferroni correction method. The P-value cutoff option allows the user select a P-value threshold. Classes from the selected category for which the P-vlaue of enrichment of the input CRE is lower than the specified cutoff are displayed in the output.




Interpretaion of outputs


Class Number of CRE containing genes in class(Gc) Number of genes in class(Tc) Percentage of CRE containing gene is class (Pc = (Gc/Tc) X 100 ) Number of CRE containing genes in reference set(Gr) Number of genes in reference set(Tr) Percentage of CRE containing genes in reference set ( Pr=(Gr/Tr) X 100 ) Ratio (Pc/Pr) Hypergeometic P-value Corrected P-value Genes which contain input CRE
GO:0004805(trehalose-phosphatase activity) 7 16 43.750 1901 22229 8.552 5.12 0.0001892 0.04959604 At1g68020,At2g22190,
At4g12430,At4g22590,
At4g39770,At5g51460,
At5g65140

                                                                                                                               Figure 1


The  output of AtREA CRE distribution module (Figure 1) contains the following columns

  • The first column contains the name of the functional class or expression slide along with links to sites that contain more information regarding the functional class or slide.

  • The next three column(2-4) contains the number  of CRE containing promoters in class, the total number of genes in the class and the percentage of promoters from class which contain the CRE.

  • The next three columns(5-7) contain the same set of data for the reference set (which contains 22229 1Kb upstream sequence).

  • Column 8 contains the ratio of occurrence of CRE containing promoters in class and reference set
  • Column 9 (Hypergeometric P-vlaue) contains the cumulative hypergeometric P-values of observing eqaul or greater number of  CRE containing promoters in a set containing same number of genes by chance. Column 10 (Corrected P-value) is obtained by multiplying the cumulative hypergeomteric P-value with the total number of classes analysed.
  • The last column list genes from class which contain the user defined CRE in their upstream sequence.

Results from a sample run (with DREB motif) of distribution analysis module in different class categories