PeakAnnotator 1.0 (tar/gz)

Release notes:

////////////////////////////////////////////////////////////////////////////////
INSTALLATION FROM SOURCE
////////////////////////////////////////////////////////////////////////////////

Go to the PeakAnnotator.src folder. 
Compile the source files using
g++ -o PeakAnnotator *.cpp 
An executable file named PeakAnnotator will be generated

////////////////////////////////////////////////////////////////////////////////
USAGE
////////////////////////////////////////////////////////////////////////////////

To launch the program, open a terminal window, 
go to the folder where PeakAnnotator executable file is located, and type:

./PeakAnnotator in order to get the three utilities of the program:

>PeakAnnotator
NDG for each peak finds its closest downstream gene on both strands
TSS for each peak finds the distance to its closest TSS
ODS finds overlaps between two position files

Type PeakAnnotator  to get help about the options specific to each utility.


*** utility: 
This can be one of "NDG, TSS, ODS"
1. NDG - For each locus, search for its Nearest Downstream Genes on both 
the forward and reverse strand. 
If the position of the locus is within a gene, the program describes 
in which part of that gene the locus is located. 
2. TSS - For each locus, find its closest TSS (transcription start site). 
In order to do this, the program searches both upstream and downstream for 
the closest genes to the genomic coordinate.
3. ODS - Compare between two position files, to identify overlapping and 
unique genomic locations. Uses random regions matched for chromosome and 
length to calculate an enrichment over random and p-value.

*** Peak File 
The file lists the genomic coordinates output by a peak calling program 
(or obtained in some other way). The format should be tab/space delimited, 
where each locus is described by its "chromosome", "start" and "end" location.
PLEASE REMOVE ANY HEADER LINES FROM THE FILE IF THESE ARE PRESENT
THIS FILE SHOULD BE SORTED ACCORDING TO CHROMOSOME AND START LOCATION

*** Annotation File
The file lists the features/genes of interest and their location in the genome.
This file should be in BED format, which can be obtained using the UCSC table browser. 
The BED format is defined in "http://genome.ucsc.edu/FAQ/FAQformat#format1".

The annotation file requires three fields: chromosome, start and end locations. 
However, if the features of interests are genes, it is highly recommended that the 
annotation file includes the nine additional optional BED fields. These can be
output by the UCSC table browser by selecting "BED-browser extensible data". 

Requirements for BED file format - NDG utility:
The following fields (columns) are recommended for the NDG utility: 
chrom, chromStart, chrEnd, name, strand, thickStart, thickEnd, blockCount, 
blockSizes, blockStarts.

Requirements for BED file format - TSS utility:
The following fields (columns) are required: 
chrom, chromStart, chrEnd, strand.

Please note that according to BED format, lower-numbered fields (columns) must 
always be present if higher-numbered fields are used. Hence, although the field 
"name" is not required for TSS, it should be specified in the file (inserting any 
character in column number 4 in the file is sufficient). 

*** Output File
An output file name must be specified. 

*** Use5endDistance 
This is a required parameter for the "TSS" utility.
This parameter defines how the distance is calculated.
If "true", the distance will be calculated between a central position of a peak, and the 5'-end of a gene.
If "false", the distance will be calculated between a central position of a peak, and the TSS (transcription start site) of a gene.

*** Symbol File
This is an optional parameter for the "NDG" and "TSS" utilities.
The symbol file maps accession numbers to gene symbols; these can be obtained using 
the BioMart feature of Ensembl or from the UCSC table browser. 

*** chrSizeFile
This is an optional parameter for the "ODS" utility.
This file specify the size of each chromosome, and if its provided, a randomization test will be done
in order to calculate the intersection p value, and enrichment over random.

*** numRandomDatasets
Number of random datasets to generate when calculating overlap p value (default 1000).
Random regions matched by chromosome and length to the first regions file, are intersected with the second.


////////////////////////////////////////////////////////////////////////////////
OUTPUT FILES
////////////////////////////////////////////////////////////////////////////////

The output of the "NDG" utility is two tab delimited files:
**************************************************************

A. "OutputFileName" as specified in the command line

This file describes the closest downstream genes for each genomic locus, and contains the following fields:
	1. Chromosome 
	2. Start
	3. End 	- These first three columns describe the location of the peak in the genome.
	4. # Overlapped_Genes - Number of transcripts overlapping the genomic loci. 
	   More details about these genes are reported in the second output file described below. 
	5. Downstream_FW_Gene	- ID of the closest downstream gene on the forward strand.
        (6. Symbol	- If a symbol file is specified, this field will contains the symbol of the closest downstream gene on the forward strand.)
	7. Distance	- Peak distance to its closest downstream gene on the forward strand.
	8. Downstream_REV_gene	- ID of the closest downstream gene on the reverse strand.
        9. (Symbol	- If a symbol file is specified, this field will contains the symbol of the closest downstream gene on the reverse strand.)
	10. Distance	- Peak distance to its closest downstream gene on the reverse strand.
	
B. "Overlap_OutputFileName"

This file describes the transcripts overlapping the peaks, if any such are found.
	1. Chromosome 
	2. Start
	3. End	- These first three columns describe the location of the peak in the genome.
	4. OverlapGene	- Overlapping gene ID
        (5. Symbol	- If a symbol file is specified, this field will contains the overlapping gene symbol)
	6. Overlap_Begin	- In which part of the gene does the peak's start position overlap
	7. Overlap_Center	- In which part of the gene does the peak's central position overlap
	8. Overlap_End	- In which part of the gene does the peak's end position overlap


The output of the TSS option is a tab-delimited file:
*****************************************************

"OutputFileName" as specified in the command line

This file contains the following fields:
	1. Chromosome 
	2. Start
	3. End	- These first three columns describe the location of the peak in the genome.
	4. Distance	- The distance from the peak to its closest TSS.
	5. GeneStart 	- The start location of the closest gene on the genome.
	6. GeneEnd 	- The end location of the closest gene on the genome.
	4. ClosestTSS_ID	- ID of the closest gene.
        (5. Symbol	- If a symbol file is specified, this field will contains the symbol of the closest gene.)
	6. Strand	- Strand of closest gene.

The output of the "ODS" option is three tab delimited files:
******************************************************************

A. "OutputFileName" as specified in the command line

Each line in this file describes an overlap event between two genomic loci, and has the following fields: 
	1. Chromosome
	2. peakFile1_Start 	- Start location of the first genomic locus
	3. peakFile1_End	- End location of the first genomic locus
	4. peakFile1_Name	- Name of the first genomic locus (if it exist in the input file)
	5. peakFile2_Start	- Start location of the second genomic locus
	6. peakFile2_End	- End location of the second genomic locus
	7. peakFile2_Name	- Name of the second genomic locus (if it exist in the input file)

B+C. "PeakFileName.unique" - one file for each genomic input file, which describes the unique peaks.


Changelog:


                    

PeakAnnotator 1.0 (tar/gz)

Release notes:

////////////////////////////////////////////////////////////////////////////////
INSTALLATION
////////////////////////////////////////////////////////////////////////////////

This folder contains the executable file PeakAnnotator.jar and the archive 
PeakAnnotator.src.zip which contains source directories.  
You can move the "PeakAnnotator.jar" exe file to anywhere in your file system 
and set the PATH to this location.

////////////////////////////////////////////////////////////////////////////////
USAGE
////////////////////////////////////////////////////////////////////////////////

You should have Java 1.5 or later installed.
In order to launch the program, open a terminal window, go to the folder
where the jar file is located, and type

java -jar -Xmx512m peakAnnotator.jar <-u utility> [options]

Options include:

help,-?                		displays help information
-u,--utility 		utility: NDG, TSS, ODS
-p,--peakFile 		input peak file
-a,--annotationFile 	input annotation GTF or BED file
-d,--use5EndDistance 	If true the distance will be calculated relative to 5' end, if false relative to tss (default true)
-p2,--peakFile2 	input second peak file
-o,--outDir     	output folder
-x,--prefix     	string to add to output file names
-s,--symbolFile 	optional input symbol file
-g,--geneType   	gene type for annotation: protein_coding or all
-cs,--chrSizeFile       file indicating chromosome sizes
-r,--numRandomDatasets number of random datasets to generate when calculating overlap p value (default 1000)

Press -u  to get help about the options specific to each utility.

*** -u/--utility
This can be one of "NDG, TSS, ODS"
1. NDG - For each locus, search for its Nearest Downstream Genes on both 
the forward and reverse strand. 
If the position of the locus is within a gene, the program describes 
in which part of that gene the locus is located. 
2. TSS - For each locus, find its closest TSS (transcription start site). 
In order to do this, the program searches both upstream and downstream for 
the closest genes to the genomic coordinate.
3. ODS - Compare between two position files, to identify overlapping and 
unique genomic locations.

*** -p/--peakFile 
The file lists the genomic coordinates output by a peak calling program 
(or obtained in some other way). The format should be tab/space delimited, 
where each locus is described by its "chromosome", "start" and "end" location.
This file should be sorted by chromosome and start position.
PLEASE REMOVE ANY HEADER LINES FROM THE FILE IF THESE ARE PRESENT

*** -a/--annotationFile
This is a REQUIRED parameter for the "NDG" and "TSS" utilities.
The file lists the features/genes of interest and their locations in the 
genome, in one of two formats:
1. GTF format - can be downloaded from Ensembl ftp site at: 
		http://www.ensembl.org/info/data/ftp/index.html
		GTF FILES ARE EXPECTED TO CONTAIN THE SUFFIX ".gtf"

The GTF format is recommended unless you are interested in annotating your 
peaks relative to features other than genes. In that case you can use the BED 
file format described below.
		
2. BED format - can be downloaded from the UCSC table browser tool. 
		The BED format is defined in "http://genome.ucsc.edu/FAQ/FAQformat#format1".

Requirements for BED file format - NDG utility:
The following fields (columns) are required: 
chrom, chromStart, chrEnd, name, strand, thickStart, thickEnd, blockCount, 
blockSizes, blockStarts.

Requirements for BED file format - TSS utility:
The following fields (columns) are required: 
chrom, chromStart, chrEnd, strand.

Please note that according to BED format, lower-numbered fields (columns) must 
always be present if higher-numbered fields are used. Hence, although the field 
"name" is not required for TSS, it should be specified in the file (inserting any 
character in column number 4 in the file is sufficient). 

***-d/--use5EndDistance (default true)
This is a required parameter for the "TSS" utility.
This parameter defines how the distance is calculated.
If "true", the distance will be calculated between a central position of a peak, and the 5'-end of a gene.
If "false", the distance will be calculated between a central position of a peak, and the TSS (transcription start site) of a gene.

***-p2/--peakFile2
This is a REQUIRED parameter for the "ODS" utility
The format is the same as for the first peakFile (refer to the -p/--peakFile help).

*** -o/--outDir
This is a REQUIRED parameter for peakAnnotator.
An output directory must be specified where PeakAnnotator can write result files. 

*** -x/--prefix
String to add to output file names, for example when the same peak files are to be 
analyzed using different parameters.

*** -s/--symbolFile
This is an optional parameter for the "NDG" and "TSS" utilities.
The symbol file maps accession numbers to gene symbols; these can be obtained using 
the BioMart feature of Ensembl or from the UCSC table browser. 
This option is necessary when using BED format annotation file, since these do not 
contain gene symbols. A symbol file is not required for Ensembl GTF annotation files. 

***-g/--geneType
When the annotation file is in GTF format, the user has the option to choose the 
category of genes considered for annotation: either "protein_coding" or "all". 
"all" includes protein coding as well as non-protein coding genes such as miRNAs 
and other non-coding RNAs.

***-cs/--chrSizeFile
This is an optional parameter for the "ODS" utility.
This file specify the size of each chromosome, and if its provided, a randomization test will be done
in order to calculate the intersection p value, and enrichment over random.

***-r,--numRandomDatasets
Number of random datasets to generate when calculating overlap p value (default 1000).
Random regions matched by chromosome and length to the first regions file, are intersected with the second.


////////////////////////////////////////////////////////////////////////////////
OUTPUT FILES
////////////////////////////////////////////////////////////////////////////////

The output of the "NDG" utility is three tab delimited files:
**************************************************************

A. "peakFileName.ndg.peakFileNameSuffix"

For example, if the input peak file is "myPeaks.test, the output file will be 
"myPeaks.ndg.test". 
This file identifies the closest downstream genes for each locus, and contains 
the following fields:
	1. Chromosome 
	2. Start
	3. End 	- These first three columns describe the genomic location of the peak.
	4. # Overlapped_Genes - Number of transcripts overlapping the genomic loci. 
	   Details about these genes are reported in the second output file described below. 
	5. Downstream_FW_Gene	- ID of the closest downstream gene on the forward strand.
        6. Symbol	- Symbol of the closest downstream gene on the forward strand.
	7. Distance	- Peak distance to its closest downstream gene on the forward strand.
	8. Downstream_REV_gene	- ID of the closest downstream gene on the reverse strand.
        9. Symbol	- Symbol of the closest downstream gene on the reverse strand.
	10. Distance	- Peak distance to its closest downstream gene on the reverse strand.
	
B. "peakFileName.overlap.peakFileNameSuffix"

For example, if the input peak file is "myPeaks.test", the output file will be 
"myPeaks.overlap.test". 
This file describes the transcripts overlapping the peaks, if any such are found.
	1. Chromosome 
	2. Start
	3. End	- These first three columns describe the genomic location of the peak.
	4. OverlapGene	- Overlapping gene ID
        5. Symbol	- Overlapping gene symbol
	6. Overlap_Begin	- In which part of the gene does the peak's start position overlap
	7. Overlap_Center	- In which part of the gene does the peak's central position overlap
	8. Overlap_End	- In which part of the gene does the peak's end position overlap

C. "peakFileName.summary.peakFileNameSuffix"

For example, if the input peak file is "myPeaks.test", the output file will be 
"myPeaks.summary.test". 
This file contains the following fields
	1. Chromosome 
	2. Start
	3. End 	- These first three columns describe the genomic location of the peak.
	4. OverlapGene	- Overlapping gene Symbol.
	5. Downstream Gene	- Nearest downstream gene.
	6. Distance	- Peak distance to its nearest downstream gene.

The output of the TSS option is a tab-delimited file:
*****************************************************

"peakFileName.tss.peakFileNameSuffix"

For example, if the input peak file is "myPeaks.test", the output file will be 
"myPeaks.tss.test"

This file contains the following fields:
	1. Chromosome 
	2. Start
	3. End	- These first three columns describe the genomic location of the peak.
	4. Distance	- The distance from the peak to its closest TSS.
	5. GeneStart 	- The start location of the closest gene on the genome.
	6. GeneEnd 	- The end location of the closest gene on the genome.
	4. ClosestTSS_ID	- ID of the closest gene.
        5. Symbol	- Symbol of the closest gene.
	6. Strand	- Strand of closest gene.

The output of the "ODS" option is three tab delimited files:
******************************************************************

A. "peakFile1_peakFile2.overlap.txt"

For example, if the input peak files are "myPeaks1.txt" and "myPeaks2.txt", 
the output file will be "myPeaks1_myPeaks2.overlap.txt"

Each line in this file describes an overlap event between two genomic loci, and 
has the following fields: 
	1. Chromosome
	2. peakFile1_Start 	- Start location of the first genomic locus
	3. peakFile1_End	- End location of the first genomic locus
	4. peakFile1_Name	- Name of the first genomic locus (if it exist in the input file)
	5. peakFile2_Start	- Start location of the second genomic locus
	6. peakFile2_End	- End location of the second genomic locus
	7. peakFile2_Name	- Name of the second genomic locus (if it exist in the input file)

B+C. Unique files - one file for each genomic input file, which describes the unique peaks


Changelog:


                    

© 1998-2025 Scilico, LLC. All rights reserved.