Download notes and changelog - Bioinformatics.org

PeakAnnotator 1.0 (tar/gz)

Release notes:

////////////////////////////////////////////////////////////////////////////////
INSTALLATION FROM SOURCE
////////////////////////////////////////////////////////////////////////////////

Go to the PeakAnnotator.src folder.
Compile the source files using
g++ -o PeakAnnotator *.cpp
An executable file named PeakAnnotator will be generated

////////////////////////////////////////////////////////////////////////////////
USAGE
////////////////////////////////////////////////////////////////////////////////

To launch the program, open a terminal window,
go to the folder where PeakAnnotator executable file is located, and type:

./PeakAnnotator in order to get the three utilities of the program:

>PeakAnnotator
NDG for each peak finds its closest downstream gene on both strands
TSS for each peak finds the distance to its closest TSS
ODS finds overlaps between two position files

Type PeakAnnotator to get help about the options specific to each utility.

*** utility:
This can be one of "NDG, TSS, ODS"
1. NDG - For each locus, search for its Nearest Downstream Genes on both
the forward and reverse strand.
If the position of the locus is within a gene, the program describes
in which part of that gene the locus is located.
2. TSS - For each locus, find its closest TSS (transcription start site).
In order to do this, the program searches both upstream and downstream for
the closest genes to the genomic coordinate.
3. ODS - Compare between two position files, to identify overlapping and
unique genomic locations. Uses random regions matched for chromosome and
length to calculate an enrichment over random and p-value.

*** Peak File
The file lists the genomic coordinates output by a peak calling program
(or obtained in some other way). The format should be tab/space delimited,
where each locus is described by its "chromosome", "start" and "end" location.
PLEASE REMOVE ANY HEADER LINES FROM THE FILE IF THESE ARE PRESENT
THIS FILE SHOULD BE SORTED ACCORDING TO CHROMOSOME AND START LOCATION

*** Annotation File
The file lists the features/genes of interest and their location in the genome.
This file should be in BED format, which can be obtained using the UCSC table browser.
The BED format is defined in "http://genome.ucsc.edu/FAQ/FAQformat#format1".

The annotation file requires three fields: chromosome, start and end locations.
However, if the features of interests are genes, it is highly recommended that the
annotation file includes the nine additional optional BED fields. These can be
output by the UCSC table browser by selecting "BED-browser extensible data".

Requirements for BED file format - NDG utility:
The following fields (columns) are recommended for the NDG utility:
chrom, chromStart, chrEnd, name, strand, thickStart, thickEnd, blockCount,
blockSizes, blockStarts.

Requirements for BED file format - TSS utility:
The following fields (columns) are required:
chrom, chromStart, chrEnd, strand.

Please note that according to BED format, lower-numbered fields (columns) must
always be present if higher-numbered fields are used. Hence, although the field
"name" is not required for TSS, it should be specified in the file (inserting any
character in column number 4 in the file is sufficient).

*** Output File
An output file name must be specified.

*** Use5endDistance
This is a required parameter for the "TSS" utility.
This parameter defines how the distance is calculated.
If "true", the distance will be calculated between a central position of a peak, and the 5'-end of a gene.
If "false", the distance will be calculated between a central position of a peak, and the TSS (transcription start site) of a gene.

*** Symbol File
This is an optional parameter for the "NDG" and "TSS" utilities.
The symbol file maps accession numbers to gene symbols; these can be obtained using
the BioMart feature of Ensembl or from the UCSC table browser.

*** chrSizeFile
This is an optional parameter for the "ODS" utility.
This file specify the size of each chromosome, and if its provided, a randomization test will be done
in order to calculate the intersection p value, and enrichment over random.

*** numRandomDatasets
Number of random datasets to generate when calculating overlap p value (default 1000).
Random regions matched by chromosome and length to the first regions file, are intersected with the second.

////////////////////////////////////////////////////////////////////////////////
OUTPUT FILES
////////////////////////////////////////////////////////////////////////////////

The output of the "NDG" utility is two tab delimited files:
**************************************************************

A. "OutputFileName" as specified in the command line

This file describes the closest downstream genes for each genomic locus, and contains the following fields:
1. Chromosome
2. Start
3. End - These first three columns describe the location of the peak in the genome.
4. # Overlapped_Genes - Number of transcripts overlapping the genomic loci.
More details about these genes are reported in the second output file described below.
5. Downstream_FW_Gene - ID of the closest downstream gene on the forward strand.
(6. Symbol - If a symbol file is specified, this field will contains the symbol of the closest downstream gene on the forward strand.)
7. Distance - Peak distance to its closest downstream gene on the forward strand.
8. Downstream_REV_gene - ID of the closest downstream gene on the reverse strand.
9. (Symbol - If a symbol file is specified, this field will contains the symbol of the closest downstream gene on the reverse strand.)
10. Distance - Peak distance to its closest downstream gene on the reverse strand.

B. "Overlap_OutputFileName"

This file describes the transcripts overlapping the peaks, if any such are found.
1. Chromosome
2. Start
3. End - These first three columns describe the location of the peak in the genome.
4. OverlapGene - Overlapping gene ID
(5. Symbol - If a symbol file is specified, this field will contains the overlapping gene symbol)
6. Overlap_Begin - In which part of the gene does the peak's start position overlap
7. Overlap_Center - In which part of the gene does the peak's central position overlap
8. Overlap_End - In which part of the gene does the peak's end position overlap

The output of the TSS option is a tab-delimited file:
*****************************************************

"OutputFileName" as specified in the command line

This file contains the following fields:
1. Chromosome
2. Start
3. End - These first three columns describe the location of the peak in the genome.
4. Distance - The distance from the peak to its closest TSS.
5. GeneStart - The start location of the closest gene on the genome.
6. GeneEnd - The end location of the closest gene on the genome.
4. ClosestTSS_ID - ID of the closest gene.
(5. Symbol - If a symbol file is specified, this field will contains the symbol of the closest gene.)
6. Strand - Strand of closest gene.

The output of the "ODS" option is three tab delimited files:
******************************************************************

A. "OutputFileName" as specified in the command line

Each line in this file describes an overlap event between two genomic loci, and has the following fields:
1. Chromosome
2. peakFile1_Start - Start location of the first genomic locus
3. peakFile1_End - End location of the first genomic locus
4. peakFile1_Name - Name of the first genomic locus (if it exist in the input file)
5. peakFile2_Start - Start location of the second genomic locus
6. peakFile2_End - End location of the second genomic locus
7. peakFile2_Name - Name of the second genomic locus (if it exist in the input file)

B+C. "PeakFileName.unique" - one file for each genomic input file, which describes the unique peaks.

Changelog:

PeakAnnotator 1.0 (tar/gz)

Release notes:

////////////////////////////////////////////////////////////////////////////////
INSTALLATION
////////////////////////////////////////////////////////////////////////////////

This folder contains the executable file PeakAnnotator.jar and the archive
PeakAnnotator.src.zip which contains source directories.
You can move the "PeakAnnotator.jar" exe file to anywhere in your file system
and set the PATH to this location.

////////////////////////////////////////////////////////////////////////////////
USAGE
////////////////////////////////////////////////////////////////////////////////

You should have Java 1.5 or later installed.
In order to launch the program, open a terminal window, go to the folder
where the jar file is located, and type

java -jar -Xmx512m peakAnnotator.jar <-u utility> [options]

Options include:

help,-? displays help information
-u,--utility utility: NDG, TSS, ODS
-p,--peakFile input peak file
-a,--annotationFile input annotation GTF or BED file
-d,--use5EndDistance If true the distance will be calculated relative to 5' end, if false relative to tss (default true)
-p2,--peakFile2 input second peak file
-o,--outDir output folder
-x,--prefix string to add to output file names
-s,--symbolFile optional input symbol file
-g,--geneType gene type for annotation: protein_coding or all
-cs,--chrSizeFile file indicating chromosome sizes
-r,--numRandomDatasets number of random datasets to generate when calculating overlap p value (default 1000)

Press -u to get help about the options specific to each utility.

*** -u/--utility
This can be one of "NDG, TSS, ODS"
1. NDG - For each locus, search for its Nearest Downstream Genes on both
the forward and reverse strand.
If the position of the locus is within a gene, the program describes
in which part of that gene the locus is located.
2. TSS - For each locus, find its closest TSS (transcription start site).
In order to do this, the program searches both upstream and downstream for
the closest genes to the genomic coordinate.
3. ODS - Compare between two position files, to identify overlapping and
unique genomic locations.

*** -p/--peakFile
The file lists the genomic coordinates output by a peak calling program
(or obtained in some other way). The format should be tab/space delimited,
where each locus is described by its "chromosome", "start" and "end" location.
This file should be sorted by chromosome and start position.
PLEASE REMOVE ANY HEADER LINES FROM THE FILE IF THESE ARE PRESENT

*** -a/--annotationFile
This is a REQUIRED parameter for the "NDG" and "TSS" utilities.
The file lists the features/genes of interest and their locations in the
genome, in one of two formats:
1. GTF format - can be downloaded from Ensembl ftp site at:
http://www.ensembl.org/info/data/ftp/index.html
GTF FILES ARE EXPECTED TO CONTAIN THE SUFFIX ".gtf"

The GTF format is recommended unless you are interested in annotating your
peaks relative to features other than genes. In that case you can use the BED
file format described below.

2. BED format - can be downloaded from the UCSC table browser tool.
The BED format is defined in "http://genome.ucsc.edu/FAQ/FAQformat#format1".

Requirements for BED file format - NDG utility:
The following fields (columns) are required:
chrom, chromStart, chrEnd, name, strand, thickStart, thickEnd, blockCount,
blockSizes, blockStarts.

Requirements for BED file format - TSS utility:
The following fields (columns) are required:
chrom, chromStart, chrEnd, strand.

***-d/--use5EndDistance (default true)
This is a required parameter for the "TSS" utility.
This parameter defines how the distance is calculated.
If "true", the distance will be calculated between a central position of a peak, and the 5'-end of a gene.
If "false", the distance will be calculated between a central position of a peak, and the TSS (transcription start site) of a gene.

***-p2/--peakFile2
This is a REQUIRED parameter for the "ODS" utility
The format is the same as for the first peakFile (refer to the -p/--peakFile help).

*** -o/--outDir
This is a REQUIRED parameter for peakAnnotator.
An output directory must be specified where PeakAnnotator can write result files.

*** -x/--prefix
String to add to output file names, for example when the same peak files are to be
analyzed using different parameters.

*** -s/--symbolFile
This is an optional parameter for the "NDG" and "TSS" utilities.
The symbol file maps accession numbers to gene symbols; these can be obtained using
the BioMart feature of Ensembl or from the UCSC table browser.
This option is necessary when using BED format annotation file, since these do not
contain gene symbols. A symbol file is not required for Ensembl GTF annotation files.

***-g/--geneType
When the annotation file is in GTF format, the user has the option to choose the
category of genes considered for annotation: either "protein_coding" or "all".
"all" includes protein coding as well as non-protein coding genes such as miRNAs
and other non-coding RNAs.

***-cs/--chrSizeFile
This is an optional parameter for the "ODS" utility.
This file specify the size of each chromosome, and if its provided, a randomization test will be done
in order to calculate the intersection p value, and enrichment over random.

***-r,--numRandomDatasets
Number of random datasets to generate when calculating overlap p value (default 1000).
Random regions matched by chromosome and length to the first regions file, are intersected with the second.

////////////////////////////////////////////////////////////////////////////////
OUTPUT FILES
////////////////////////////////////////////////////////////////////////////////

The output of the "NDG" utility is three tab delimited files:
**************************************************************

A. "peakFileName.ndg.peakFileNameSuffix"

For example, if the input peak file is "myPeaks.test, the output file will be
"myPeaks.ndg.test".
This file identifies the closest downstream genes for each locus, and contains
the following fields:
1. Chromosome
2. Start
3. End - These first three columns describe the genomic location of the peak.
4. # Overlapped_Genes - Number of transcripts overlapping the genomic loci.
Details about these genes are reported in the second output file described below.
5. Downstream_FW_Gene - ID of the closest downstream gene on the forward strand.
6. Symbol - Symbol of the closest downstream gene on the forward strand.
7. Distance - Peak distance to its closest downstream gene on the forward strand.
8. Downstream_REV_gene - ID of the closest downstream gene on the reverse strand.
9. Symbol - Symbol of the closest downstream gene on the reverse strand.
10. Distance - Peak distance to its closest downstream gene on the reverse strand.

B. "peakFileName.overlap.peakFileNameSuffix"

For example, if the input peak file is "myPeaks.test", the output file will be
"myPeaks.overlap.test".
This file describes the transcripts overlapping the peaks, if any such are found.
1. Chromosome
2. Start
3. End - These first three columns describe the genomic location of the peak.
4. OverlapGene - Overlapping gene ID
5. Symbol - Overlapping gene symbol
6. Overlap_Begin - In which part of the gene does the peak's start position overlap
7. Overlap_Center - In which part of the gene does the peak's central position overlap
8. Overlap_End - In which part of the gene does the peak's end position overlap

C. "peakFileName.summary.peakFileNameSuffix"

For example, if the input peak file is "myPeaks.test", the output file will be
"myPeaks.summary.test".
This file contains the following fields
1. Chromosome
2. Start
3. End - These first three columns describe the genomic location of the peak.
4. OverlapGene - Overlapping gene Symbol.
5. Downstream Gene - Nearest downstream gene.
6. Distance - Peak distance to its nearest downstream gene.

The output of the TSS option is a tab-delimited file:
*****************************************************

"peakFileName.tss.peakFileNameSuffix"

For example, if the input peak file is "myPeaks.test", the output file will be
"myPeaks.tss.test"

This file contains the following fields:
1. Chromosome
2. Start
3. End - These first three columns describe the genomic location of the peak.
4. Distance - The distance from the peak to its closest TSS.
5. GeneStart - The start location of the closest gene on the genome.
6. GeneEnd - The end location of the closest gene on the genome.
4. ClosestTSS_ID - ID of the closest gene.
5. Symbol - Symbol of the closest gene.
6. Strand - Strand of closest gene.

The output of the "ODS" option is three tab delimited files:
******************************************************************

A. "peakFile1_peakFile2.overlap.txt"

For example, if the input peak files are "myPeaks1.txt" and "myPeaks2.txt",
the output file will be "myPeaks1_myPeaks2.overlap.txt"

Each line in this file describes an overlap event between two genomic loci, and
has the following fields:
1. Chromosome
2. peakFile1_Start - Start location of the first genomic locus
3. peakFile1_End - End location of the first genomic locus
4. peakFile1_Name - Name of the first genomic locus (if it exist in the input file)
5. peakFile2_Start - Start location of the second genomic locus
6. peakFile2_End - End location of the second genomic locus
7. peakFile2_Name - Name of the second genomic locus (if it exist in the input file)

B+C. Unique files - one file for each genomic input file, which describes the unique peaks

Changelog: