GO2MSIG GO based GSEA gene set generator

GO2MSIG User Guide and Installation Instructions (29/10/2013)

1 Quick Start
2 Algorithm, data sources and output
3 Installation and testing
4 Usage and examples
5 Appendix A - species contained in the NCBI gene2go table
6 Appendix B - GO2MSIG switches and defaults

Quick Start

GO2MSIG generates collections of gene sets in MSigDB format based on the Gene Ontology (GO) project hierarchy and gene association data, for use with the Gene Set Enrichment Analysis (GSEA) implementation available at the Broad Institute. This enables rapid creation of gene set collections for multiple species.

The easiest way to use GO2MSIG is at the website:

http://www.go2msig.org/cgi-bin/go2msig.cgi

The website is setup to use sensible defaults. NCBI gene2go is a good database to use if your organism is contained in that dataset (Appendix A). If not, then the GO annotation database is also available. The default set of evidence codes used is the same as for the prebuilt GO based gene set collections provided at MSigDB. For many less well characterised species almost all GO term associations have the automated electronic annotation code 'IEA' and including all evidence codes is more appropriate in that case.

2 Algorithm, data sources and output

2.1 Data sources

GO2MSIG needs a data source describing the GO term hierarchy (the 'term source') and a data source describing which genes are associated with which terms (the 'association source'). The term source can be either a local MySQL installation of the GO database, an externally hosted MySQL based GO database (for instance the one at mysql.ebi.ac.uk:4085), an OBO file containing the GO hierarchy, or a cached GO hierarchy previously generated from one of the preceding data sources by GO2MSIG.

The association data source can be either a local MySQL installation of the GO database, an externally hosted MySQL based GO database, a GO project GAF file, an Affymetrix or Agilent array annotation file, or a locally hosted MySQL version of the NCBI gene2go and geneinfo tables.

Using an externally hosted GO database is the easiest option since it requires no downloading and installation of GO term data, but can be slow.

Consequently a basic install requires simply that GO2MSIG and a number of Perl libraries are present, whereas more fully featured installs require the local MySQL databases. For the species included in the NCBI gene2go tables (Appendix A), gene2go is the cleanest source of data available in terms of consistent use of gene identifiers.

2.2 Algorithm

A GO project ontology is represented by an acyclic directed graph, each term being a node in the graph. Each parent term can have multiple children, and each child term can have multiple parents. There is a single root term for each of three ontologies, ‘molecular function’, ‘biological process’ and ‘cellular component’.

GO2MSIG builds gene sets based on GO terms in two steps. The first is identifying which genes from the organism in question are associated with which GO terms. The second is propagating these gene associations up towards the root term. For instance if a gene is associated with the 'calcium transport' molecular function, it should also be associated with the 'ion transport', and 'transport' functions. These higher level associations are often not captured in the primary sources of gene association data, where only the most specific GO term membership is annotated, however it is crucial for detecting biologically relevant patterns in data sets.

Gene set collections built in this way will likely contain multiple GO terms with identical gene associations. Such duplicate gene sets can affect the accuracy of the key GSEA false discovery rate statistic and it is necessary that these are removed from the gene set collection. GO2MSIG prunes the result so that each unique gene set (by gene content) is recorded once.

2.3 Output

GO2MSIG produces output in either MSigDB .gmt or .gmx format. The description field in the output gene set file contains a list of ALL GO terms with that set of gene associations. The URL link field (which can only reference one term) contains a link to whichever of the GO terms has the shortest distance between it and the root term - in other words the most general of the terms associated with that gene set.

During analysis of the results from a GO based gene set analysis the experimenter is likely to want to home in on more specific terms that show statistically significant changes. In this implementation a rough guide is provided by calculating how many terms exist between the GO term in question and the root of the ontology tree, taking the shortest path. This distance is shown in brackets at the end of each term name in the description field.

3 Installation and testing

3.1 Basic program installation on Linux

GO2MSIG is written in Perl and as such does not need to be compiled. A number of Perl libraries that are not part of the standard distribution need to be present. These are:

Text::CSV
DBI
Getopt::Long
GO:Parser
MLDBM
Fcntl
Text::Wrap

Some of these will likely be installed as standard with perl on your Linux distribution, others will need to be installed from the distribution repositories, or from CPAN.

On an Ubuntu 10.04 system the non-standard modules can be installed by issuing

apt-get install libgo-perl libdbi-perl libtext-csv-perl

Place GO2MSIG into any appropriate directory in your PATH, or cc into a directory containing it and issue ./go2msig to invoke it.

3.2 Basic testing

A quick check of whether the required libraries are available is to issue

go2msig -help

This will either display help information or error indicating which required libraries are not present.

To test without having to install any databases, download additional files or access remote databases, you can use the cached version of the GO ontology contained in this directory along with the supplied E. coli gene annotation file. From this directory issue:

go2msig -termsource cache -cachefile examples/testcache -assocsource gaf -assocfile examples/ecoli_gaf -tax 511145 -e all

This should display a collection of E. coli gene sets generated from the cached GO ontology data and the E. coli GAF file.

If you have access to a GO database mirror at your institution or elsewhere you can test with it. The example below uses the mirror at ebi.ac.uk as the term source and the supplied E. coli GAF file as the association source.

go2msig -termsource godb -godb  'dbi:mysql:go_latest:mysql.ebi.ac.uk:4085;mysql_compression=1' -gouser 'go_select' -gopass 'amigo' -assocsource gaf -assocfile examples/gene_association.ecocyc -tax 83333 -e all

Similarly to use the ebi.ac.uk mirror as both the term source and as the association source, issue:

go2msig -termsource godb -godb  'dbi:mysql:go_latest:mysql.ebi.ac.uk:4085;mysql_compression=1' -gouser 'go_select' -gopass 'amigo' -assocsource godb -tax 511145 -e all

3.3 Optional local database installation - GO database

For frequent usage a local MySQL installation of the GO database will be much quicker than remote queries. GO2MSIG requires the 'assocdb' version of the GO database available at http://archive.geneontology.org/latest-full/. Ensure you have MySQL server and command line installed and running. On an ubuntu 10.04 system this can be done with:

apt-get install mysql-server mysql

Download the current go_yyyymm-assocdb-tables.tar.gz and follow instructions in the README file to install.

Create the database user used by GO2MSIG. GO2MSIG assumes the local installation is called 'mygo', the user is 'gouser' and the password is 'amigo'. Bring up a mysql command line prompt for a user which can create and populate databases. e.g. for systems without a mysql root user password, as root, issue:

mysql

The prompt should switch to 'mysql>'. To create the 'gouser' user and assign access rights, issue:

create user 'gouser'@'localhost' identified by 'amigo' ;
grant select on mygo.* to 'gouser'@'localhost' ;

3.4 Optional local database installation - NCBI gene tables

The table gene2go available from the NCBI is a curated collection of GO gene association data for the species listed in Appendix A. Unlike some of the gene association data available in the GO database, this data set has the advantage that it uses a consistent gene identifier, the Entrez gene ID. In conjunction with the NCBI geneinfo table it is possible to generate gene sets for these species using consistently either the gene ID or the standard gene symbol. However unlike the GO MySQL database the gene2go and geneinfo tables are available only in the form of a tab separated text file, so it is necessary to generate the MySQL tables and load the data using the instructions below. These create a database called 'bioannoation' which contains the two tables.

Download gene2go.gz and geneinfo.gz from ftp://ftp.ncbi.nih.gov/gene/DATA/. Uncompress these:

gunzip gene2go.gz
gunzip geneinfo.gz

Create and populate the bioannotation database, from the GO2MSIG distribution folder issue:

echo "create database bioannotation" | mysql
cat bioannotation/create_tables.sql | mysql bioannotation

This will create the database, tables and indexes.

Bring up a mysql> prompt as above. Assuming you have already created the user 'gouser' above, to assign access rights for the user to access the bioannotation database issue:

grant select on bioannotation.* to 'gouser'@'localhost' ;

To load the data issue:

use bioannotation ;
LOAD DATA LOCAL INFILE 'gene_info' INTO TABLE geneinfo FIELDS ESCAPED BY '' IGNORE 1 LINES;
LOAD DATA LOCAL INFILE 'gene2go' INTO TABLE gene2go FIELDS ESCAPED BY '' IGNORE 1 LINES;

Any warnings which occur can be displayed by issuing 'show warnings'. If these consist of truncation warnings to the modification date or other_designations that isn't a problem.

3.5 Local database testing

To generate a collection of E. coli gene sets using a local GO database installed as described above, issue:

go2msig -termsource godb -assocsource godb -tax 511145 -e all

To generate a similar collection of E. coli gene sets using the local GO database as a term source, and a local MySQL based gene2go table installed as described above, issue:

go2msig -termsource godb -assocsource ncbi -tax 511145 -gene ID id -e all

This will generate a collection of E. coli gene sets using the Entrez gene ID as the identifier.

If you have also installed a MySQL based geneinfo table you can test this by generating the same gene set using the gene symbol as the identifier:

go2msig -termsource godb -assocsource ncbi -tax 511145 -gene ID symbol -e all

4 Usage and examples

4.1 Simple gene set construction

Basic generation of gene set collections from the various databases or array annotation files is illustrated above in the testing examples, sections 2.2 and 2.4. In addition to the parameter values shown, those examples all use the default values for gene set maximum and minimum size cutoffs, required ontologies, and output file format. Program switches for these features are -maxgenes -mingenes -ontology and -format respectively. It is also possible to produce gene sets without propagating the gene associations from specific GO terms to their more general parent terms using the -nochild switch. Details on usage of these switches can be found by issuing go2msig -help. Output from this is reproduced in Appendix B.

By default the program uses the same subset of allowed evidence codes as the GO based gene set collections provided at MSigDB. For many less well characterised species almost all GO term associations have the automated electronic annotation code 'IEA' and so using the '-e all' switch (as for the examples above) would be advised in this case.

4.2 Advanced usage

The user can supply a mapping file to translate identifiers used in the original association source to identifiers of the user's choice. The mapping file is a tab separated key value list where the key (1st column) is the identifier as exists in the association source, and the value (2nd column) is the identifier to be output in the final result. If the same key exists multiple times in the map file with different values the original identifier will be expanded out into each value. By default, if an identifier in the association source is not represented in the mapping file, the original identifier will be output. One example utility here would be if the association source uses inconsistent identifiers. Say for instance the user wishes to generate a collection of gene sets for Rhododoccus jostii RHA1 from the GO database. A basic query would be:

go2msig -assocsource godb -tax 101510 -e all

The majority of the genes output are identified by an abstract ID of the form RO#####. However a small fraction are instead identified by a gene symbol. The Rhodococcus RHA1 array annotation file available from GEO (GPL3918) maps probe IDs consistently to the RO##### form of the gene ID. Thus it would be ideal to provide a mapping file that could translate the small number of gene symbols in the gene sets to the standard gene ID format for compatibility with the annotation file when used in GSEA. The annotation file from GEO itself lists a number of gene symbols and a basic mapping file can be extracted from this. More complete mapping would require some level of manual curation. Example files are available in the examples directory. The go2msig command would be:

go2msig -assocsource godb -tax 101510 -e all -mapfile examples/rha1_mapfile.txt

If the -repress switch is set when a user mapping file is being used, an identifier in the association source that is not present in the mapping file is excluded from the gene set. One example usage of this is generating gene sets using the Affymetrix E. coli 2 array annotation file as the association source. The array contains probes made to the gene complement of 4 different E. coli strains. The array annotation file maps GO terms to probe identifiers, and so if used in default fashion without a mapping file the gene sets will contain probe ids. If the user wishes to use the GSEA ability to 'collapse' probe ids to gene ids then a chip file can be supplied to GSEA that maps the probe ids to the Entrez gene ids for the E. coli species required. In this case the gene sets need to contain Entrez gene ids, not probe ids. Because the array represents multiple species some of the probe ids will not correspond to genes for the particular E. coli species in question. Using the 'repress' flag in conjunction with a mapping file that maps the probe ids to gene ids, probes to genes not present in the E. coli species in question are omitted from the resultant gene sets.

The examples directory contains an E. coli annotation file and a user map file that will generate an E. coli MG1655 specific gene set from the E. coli array annotation file. You will need to uncompress this with:

gunzip E_coli_2.na30.annot.txt.gz

Running the command without the -repress switch as below will map probe ids to Entrez gene IDs where possible, but will display probe ids for those probes without an E. coli MG1655 gene mapping. e.g.:

go2msig -assocsource affy -assocfile examples/E_coli_2.na30.annot.txt -mapfile examples/k12_user_map

The presence of both unmapped probe ids and Entrez gene ids is obvious in the output. Instead, running the command with the -repress switch will produce gene sets containing exclusively E. coli MG1655 gene ids:

go2msig -assocsource affy -assocfile examples/E_coli_2.na30.annot.txt -mapfile examples/k12_user_map -repress

4.3 Using a cache

It is possible to cache the term source data in a local file to speed up gene set builds. This is particularly useful if you wish to use a remote GO database for term source, but have a local gene association source such as a GAF file. To build a local cache file use the -q makecache switch and provide the cache file name root with -cachefile filename.

4.4 Other query commands

It is possible to interrogate the available databases for a list of species they contain gene associations for. To list the species available from the ncbi database:

go2msig -assocsource ncbi -q species

Similarly for the go database:

go2msig -assocsource godb -q species

The latter will list hundreds of thousands of species, so it's best to filter for those of interest, e.g. to find all members of the genus Rhodococcus:

go2msig -assocsource godb -q species | grep -i 'rhodococcus'

4.5 Warning messages

GO term IDs are sometimes replaced in the GO ontology with new versions that supercede the old ones. Where a term ID present in an association has been superceded by a new version (determined from synonym information in the term source) the GO term ID in the output gene sets will be replaced with the new version. A warning message will be displayed: e.g. "replacing term GO:0005498 with preferred version GO:0032934".

If the gene association data includes term IDs that are obsolete according to the current source of term information, or do not exist in the current source of term information then a warning message will also be displayed. In this case the obsolete or nonexistent term will not be output in the final gene sets as it cannot be placed into the GO term hierarchy.

Appendix A - species contained in the NCBI gene2go table

The following species annotated in the NCBI gene2go table. Where sufficient annotations are available, prebuilt gene sets for these species are available at http://www.go2msig.org/cgi-bin/prebuilt.cgi. Data for many thousands of other species is available in the GO annotation database.

#Taxon Full Name
176299 Agrobacterium fabrum str. C58
234826 Anaplasma marginale str. St. Maries
212042 Anaplasma phagocytophilum str. HZ
3702   Arabidopsis thaliana
227321 Aspergillus nidulans FGSC A4
198094 Bacillus anthracis str. Ames
9913   Bos taurus
6239   Caenorhabditis elegans
195099 Campylobacter jejuni RM1221
246194 Carboxydothermus hydrogenoformans Z-2901
227377 Coxiella burnetii RSA 493
214684 Cryptococcus neoformans var. neoformans JEC21
7955   Danio rerio
243164 Dehalococcoides ethenogenes 195
352472 Dictyostelium discoideum AX4
7227   Drosophila melanogaster
205920 Ehrlichia chaffeensis str. Arkansas
511145 Escherichia coli str. K-12 substr. MG1655
9031   Gallus gallus
243231 Geobacter sulfurreducens PCA
9606   Homo sapiens
265669 Listeria monocytogenes serotype 4b str. F2365
243233 Methylococcus capsulatus str. Bath
10090  Mus musculus
222891 Neorickettsia sennetsu str. Miyayama
40149  Oryza meridionalis
4536   Oryza nivara
4529   Oryza rufipogon
39946  Oryza sativa Indica Group
39947  Oryza sativa Japonica Group
36329  Plasmodium falciparum 3D7
223283 Pseudomonas syringae pv. tomato str. DC3000
10116  Rattus norvegicus
246200 Ruegeria pomeroyi DSS-3
559292 Saccharomyces cerevisiae S288c
284812 Schizosaccharomyces pombe 972h-
211586 Shewanella oneidensis MR-1
999953 Trypanosoma brucei brucei strain 927/4 GUTat10.1

Appendix B - GO2MSIG switches and defaults

-ontology [list of ontologies]: Takes a comma separated list of ontologies,
possible values are 'cc', 'mf' and 'bp' for the cellular compartment,
molecular function or biological process ontologies respectively. Default
is 'cc,mf,bp'.

-assocsource ['ncbi'|'godb'|'affy'|'agilent'|'gaf'] : Specify source of
gene association data. The options are a local mysql install of the ncbi
gene2go table, a mysql install of the GO database (local or remote), an
Affymetrix array annotation file, an Agilent array annotation file, or a GO
gene annotation file. Default is 'ncbi'.

-assocfile [filename] The file containing the mapping between GO terms and
probesets or genes. This is used with the -assocsource 'affy', 'agilent' or
'gaf' options.

-query ['geneset'|'species'|'makecache'] : 'geneset' will generate a gene
set collection in msigdb format. 'species' will return a list of species
that have associated GO annotations in the database being searched.
'makecache' will generate a cache of the GO ontology from whichever
termsource is selected, providing a dramatic speed up of future searches if
using a slow database server or large OBO file as the term source. Default
is 'geneset'.

-cachefile [filename] : Root of the filename to use for the cached file.
Four files are generated, filename.termnames.cache,
filename.children.cache, filename.obsids.cache and filename.altids.cache.

-termsource ['godb','obofile','cache'] : Specifies the source of the GO
ontology hierarchy. Primary sources are a GO database, or an OBO file. If a
cache has previously been generated using the -q makecache switch then
'cache' can be specified. Default is 'godb'.

-obofile [filename] : Name of the OBO file if using -termsource obofile.

-godb [database connector] : Connection string for the GO database if one
is being used. This is used in conjunction with the -gouser and -gopass
switches. The example switches for connection to the EBI implementation
would be: -godb
'dbi:mysql:go_latest:mysql.ebi.ac.uk:4085;mysql_compression=1' -gouser
'go_select' -gopass 'amigo'. If not set on the command line go2msig
defaults to using the standard local install of the GO db as described in
the installation instructions.

-gouser [username] : Username for the GO mysql database.

-gopass [password] : Password for the GO mysql database.

-evidence [list of evidences codes] : takes a comma separated list of
evidence codes which are searched for. This is ignored when using
Affymetrix or Agilent annotation files as the association source. Can be
'all' for all codes. Can be negated by prefixing codes with ! (in which
case you may need to put the code list in quotes). Full list of codes is
EXP, IC, IDA, IEA, IEP, IGC, IGI, IMP, IPI, ISA, ISM, ISO, ISS, NAS, ND,
NR, RCA, TAS, IBD, IBA, IKR, IRD, RCA. Default is 'IDA, IPI, IMP, IGI, IEP,
ISS, TAS, EXP'.

-taxid [ncbi taxon id] : The ncbi taxon number of the species/strain for
which the gene set collection is being built.

-format ['gmt'|'gmx'] : Selects gene matrix format (gmx) or gene matrix
transposed (gmt) format for the output. See the GSEA data format
documentation for an explanation of these. Default is 'gmt'.

-maxgenes [maximum number of genes] : Gene sets where the number of genes
is greater than this value are excluded. Default value is 700.

-mingenes [minimum number of genes] : Gene sets where the number of genes
is less than this value are excluded. Default value is 10.

-nochild : If this option is set then only genes directly associated with
GO terms are included in the gene set. If the option is not set then genes
associated with child terms of the GO term in question will also be
included in the set. Default is unset.

-mapfile: [mapfile name]: Optional file which contains tab separated key
value pairs for mapping the identifiers (derived from the originating NCBI
or GO database, or chip annotation file) in the final gene sets to the
value defined in the user supplied map file. The NCBI/GO value is used as
the key. If no key exists and the -repress switch is NOT set, the existing
value will be output. It the same key exists more than once in the mapping
file with different values the original identifier will be expanded into
each of the relevant values.

-repress: By default if an original gene identifier does not have an entry
in the mapfile it is left untranslated. If the repress flag is used it will
instead be removed from output. This can be used to extract single species
gene sets from affymetrix arrays that contain probes for multiple species.
Default is unset.

-geneid ['id'|'symbol'] : When obtaining associations from the ncbi gene2go
table, output either the gene id (direct from the gene2go table), or
translate the gene id to the gene symbol (using the geneinfo table).
Alternatively if obtaining gene associations from an OBO file, use the
symbol column, or the id column as the source of the gene identifier.
Default is 'symbol'.

-help : display this message

1 Quick Start 2 Algorithm, data sources and output 3 Installation and testing 4 Usage and examples 5 Appendix A - species contained in the NCBI gene2go table 6 Appendix B - GO2MSIG switches and defaults