The easiest way to use GO2MSIG is at the website:
http://www.go2msig.org/cgi-bin/go2msig.cgi
The website is setup to use sensible defaults. NCBI gene2go is a good database to use if your organism is contained in that dataset (Appendix A). If not, then the GO annotation database is also available. The default set of evidence codes used is the same as for the prebuilt GO based gene set collections provided at MSigDB. For many less well characterised species almost all GO term associations have the automated electronic annotation code 'IEA' and including all evidence codes is more appropriate in that case.
The association data source can be either a local MySQL installation of the GO database, an externally hosted MySQL based GO database, a GO project GAF file, an Affymetrix or Agilent array annotation file, or a locally hosted MySQL version of the NCBI gene2go and geneinfo tables.
Using an externally hosted GO database is the easiest option since it requires no downloading and installation of GO term data, but can be slow.
Consequently a basic install requires simply that GO2MSIG and a number of Perl libraries are present, whereas more fully featured installs require the local MySQL databases. For the species included in the NCBI gene2go tables (Appendix A), gene2go is the cleanest source of data available in terms of consistent use of gene identifiers.
GO2MSIG builds gene sets based on GO terms in two steps. The first is identifying which genes from the organism in question are associated with which GO terms. The second is propagating these gene associations up towards the root term. For instance if a gene is associated with the 'calcium transport' molecular function, it should also be associated with the 'ion transport', and 'transport' functions. These higher level associations are often not captured in the primary sources of gene association data, where only the most specific GO term membership is annotated, however it is crucial for detecting biologically relevant patterns in data sets.
Gene set collections built in this way will likely contain multiple GO terms with identical gene associations. Such duplicate gene sets can affect the accuracy of the key GSEA false discovery rate statistic and it is necessary that these are removed from the gene set collection. GO2MSIG prunes the result so that each unique gene set (by gene content) is recorded once.
During analysis of the results from a GO based gene set analysis the experimenter is likely to want to home in on more specific terms that show statistically significant changes. In this implementation a rough guide is provided by calculating how many terms exist between the GO term in question and the root of the ontology tree, taking the shortest path. This distance is shown in brackets at the end of each term name in the description field.
Text::CSV DBI Getopt::Long GO:Parser MLDBM Fcntl Text::Wrap
Some of these will likely be installed as standard with perl on your Linux distribution, others will need to be installed from the distribution repositories, or from CPAN.
On an Ubuntu 10.04 system the non-standard modules can be installed by issuing
apt-get install libgo-perl libdbi-perl libtext-csv-perl
Place GO2MSIG into any appropriate directory in your PATH, or cc into a directory containing it and issue ./go2msig to invoke it.
go2msig -help
This will either display help information or error indicating which required libraries are not present.
To test without having to install any databases, download additional files or access remote databases, you can use the cached version of the GO ontology contained in this directory along with the supplied E. coli gene annotation file. From this directory issue:
go2msig -termsource cache -cachefile examples/testcache -assocsource gaf -assocfile examples/ecoli_gaf -tax 511145 -e all
This should display a collection of E. coli gene sets generated from the cached GO ontology data and the E. coli GAF file.
If you have access to a GO database mirror at your institution or elsewhere you can test with it. The example below uses the mirror at ebi.ac.uk as the term source and the supplied E. coli GAF file as the association source.
go2msig -termsource godb -godb 'dbi:mysql:go_latest:mysql.ebi.ac.uk:4085;mysql_compression=1' -gouser 'go_select' -gopass 'amigo' -assocsource gaf -assocfile examples/gene_association.ecocyc -tax 83333 -e all
Similarly to use the ebi.ac.uk mirror as both the term source and as the association source, issue:
go2msig -termsource godb -godb 'dbi:mysql:go_latest:mysql.ebi.ac.uk:4085;mysql_compression=1' -gouser 'go_select' -gopass 'amigo' -assocsource godb -tax 511145 -e all
apt-get install mysql-server mysql
Download the current go_yyyymm-assocdb-tables.tar.gz and follow instructions in the README file to install.
Create the database user used by GO2MSIG. GO2MSIG assumes the local installation is called 'mygo', the user is 'gouser' and the password is 'amigo'. Bring up a mysql command line prompt for a user which can create and populate databases. e.g. for systems without a mysql root user password, as root, issue:
mysql
The prompt should switch to 'mysql>'. To create the 'gouser' user and assign access rights, issue:
create user 'gouser'@'localhost' identified by 'amigo' ; grant select on mygo.* to 'gouser'@'localhost' ;
Download gene2go.gz and geneinfo.gz from ftp://ftp.ncbi.nih.gov/gene/DATA/. Uncompress these:
gunzip gene2go.gz gunzip geneinfo.gz
Create and populate the bioannotation database, from the GO2MSIG distribution folder issue:
echo "create database bioannotation" | mysql cat bioannotation/create_tables.sql | mysql bioannotation
This will create the database, tables and indexes.
Bring up a mysql> prompt as above. Assuming you have already created the user 'gouser' above, to assign access rights for the user to access the bioannotation database issue:
grant select on bioannotation.* to 'gouser'@'localhost' ;
To load the data issue:
use bioannotation ; LOAD DATA LOCAL INFILE 'gene_info' INTO TABLE geneinfo FIELDS ESCAPED BY '' IGNORE 1 LINES; LOAD DATA LOCAL INFILE 'gene2go' INTO TABLE gene2go FIELDS ESCAPED BY '' IGNORE 1 LINES;
Any warnings which occur can be displayed by issuing 'show warnings'. If these consist of truncation warnings to the modification date or other_designations that isn't a problem.
go2msig -termsource godb -assocsource godb -tax 511145 -e all
To generate a similar collection of E. coli gene sets using the local GO database as a term source, and a local MySQL based gene2go table installed as described above, issue:
go2msig -termsource godb -assocsource ncbi -tax 511145 -gene ID id -e all
This will generate a collection of E. coli gene sets using the Entrez gene ID as the identifier.
If you have also installed a MySQL based geneinfo table you can test this by generating the same gene set using the gene symbol as the identifier:
go2msig -termsource godb -assocsource ncbi -tax 511145 -gene ID symbol -e all
By default the program uses the same subset of allowed evidence codes as the GO based gene set collections provided at MSigDB. For many less well characterised species almost all GO term associations have the automated electronic annotation code 'IEA' and so using the '-e all' switch (as for the examples above) would be advised in this case.
The user can supply a mapping file to translate identifiers used in the original association source to identifiers of the user's choice. The mapping file is a tab separated key value list where the key (1st column) is the identifier as exists in the association source, and the value (2nd column) is the identifier to be output in the final result. If the same key exists multiple times in the map file with different values the original identifier will be expanded out into each value. By default, if an identifier in the association source is not represented in the mapping file, the original identifier will be output. One example utility here would be if the association source uses inconsistent identifiers. Say for instance the user wishes to generate a collection of gene sets for Rhododoccus jostii RHA1 from the GO database. A basic query would be:
go2msig -assocsource godb -tax 101510 -e all
The majority of the genes output are identified by an abstract ID of the form RO#####. However a small fraction are instead identified by a gene symbol. The Rhodococcus RHA1 array annotation file available from GEO (GPL3918) maps probe IDs consistently to the RO##### form of the gene ID. Thus it would be ideal to provide a mapping file that could translate the small number of gene symbols in the gene sets to the standard gene ID format for compatibility with the annotation file when used in GSEA. The annotation file from GEO itself lists a number of gene symbols and a basic mapping file can be extracted from this. More complete mapping would require some level of manual curation. Example files are available in the examples directory. The go2msig command would be:
go2msig -assocsource godb -tax 101510 -e all -mapfile examples/rha1_mapfile.txt
If the -repress switch is set when a user mapping file is being used, an identifier in the association source that is not present in the mapping file is excluded from the gene set. One example usage of this is generating gene sets using the Affymetrix E. coli 2 array annotation file as the association source. The array contains probes made to the gene complement of 4 different E. coli strains. The array annotation file maps GO terms to probe identifiers, and so if used in default fashion without a mapping file the gene sets will contain probe ids. If the user wishes to use the GSEA ability to 'collapse' probe ids to gene ids then a chip file can be supplied to GSEA that maps the probe ids to the Entrez gene ids for the E. coli species required. In this case the gene sets need to contain Entrez gene ids, not probe ids. Because the array represents multiple species some of the probe ids will not correspond to genes for the particular E. coli species in question. Using the 'repress' flag in conjunction with a mapping file that maps the probe ids to gene ids, probes to genes not present in the E. coli species in question are omitted from the resultant gene sets.
The examples directory contains an E. coli annotation file and a user map file that will generate an E. coli MG1655 specific gene set from the E. coli array annotation file. You will need to uncompress this with:
gunzip E_coli_2.na30.annot.txt.gz
Running the command without the -repress switch as below will map probe ids to Entrez gene IDs where possible, but will display probe ids for those probes without an E. coli MG1655 gene mapping. e.g.:
go2msig -assocsource affy -assocfile examples/E_coli_2.na30.annot.txt -mapfile examples/k12_user_map
The presence of both unmapped probe ids and Entrez gene ids is obvious in the output. Instead, running the command with the -repress switch will produce gene sets containing exclusively E. coli MG1655 gene ids:
go2msig -assocsource affy -assocfile examples/E_coli_2.na30.annot.txt -mapfile examples/k12_user_map -repress
go2msig -assocsource ncbi -q species
Similarly for the go database:
go2msig -assocsource godb -q species
The latter will list hundreds of thousands of species, so it's best to filter for those of interest, e.g. to find all members of the genus Rhodococcus:
go2msig -assocsource godb -q species | grep -i 'rhodococcus'
If the gene association data includes term IDs that are obsolete according to the current source of term information, or do not exist in the current source of term information then a warning message will also be displayed. In this case the obsolete or nonexistent term will not be output in the final gene sets as it cannot be placed into the GO term hierarchy.
#Taxon Full Name 176299 Agrobacterium fabrum str. C58 234826 Anaplasma marginale str. St. Maries 212042 Anaplasma phagocytophilum str. HZ 3702 Arabidopsis thaliana 227321 Aspergillus nidulans FGSC A4 198094 Bacillus anthracis str. Ames 9913 Bos taurus 6239 Caenorhabditis elegans 195099 Campylobacter jejuni RM1221 246194 Carboxydothermus hydrogenoformans Z-2901 227377 Coxiella burnetii RSA 493 214684 Cryptococcus neoformans var. neoformans JEC21 7955 Danio rerio 243164 Dehalococcoides ethenogenes 195 352472 Dictyostelium discoideum AX4 7227 Drosophila melanogaster 205920 Ehrlichia chaffeensis str. Arkansas 511145 Escherichia coli str. K-12 substr. MG1655 9031 Gallus gallus 243231 Geobacter sulfurreducens PCA 9606 Homo sapiens 265669 Listeria monocytogenes serotype 4b str. F2365 243233 Methylococcus capsulatus str. Bath 10090 Mus musculus 222891 Neorickettsia sennetsu str. Miyayama 40149 Oryza meridionalis 4536 Oryza nivara 4529 Oryza rufipogon 39946 Oryza sativa Indica Group 39947 Oryza sativa Japonica Group 36329 Plasmodium falciparum 3D7 223283 Pseudomonas syringae pv. tomato str. DC3000 10116 Rattus norvegicus 246200 Ruegeria pomeroyi DSS-3 559292 Saccharomyces cerevisiae S288c 284812 Schizosaccharomyces pombe 972h- 211586 Shewanella oneidensis MR-1 999953 Trypanosoma brucei brucei strain 927/4 GUTat10.1
-ontology [list of ontologies]: Takes a comma separated list of ontologies, possible values are 'cc', 'mf' and 'bp' for the cellular compartment, molecular function or biological process ontologies respectively. Default is 'cc,mf,bp'. -assocsource ['ncbi'|'godb'|'affy'|'agilent'|'gaf'] : Specify source of gene association data. The options are a local mysql install of the ncbi gene2go table, a mysql install of the GO database (local or remote), an Affymetrix array annotation file, an Agilent array annotation file, or a GO gene annotation file. Default is 'ncbi'. -assocfile [filename] The file containing the mapping between GO terms and probesets or genes. This is used with the -assocsource 'affy', 'agilent' or 'gaf' options. -query ['geneset'|'species'|'makecache'] : 'geneset' will generate a gene set collection in msigdb format. 'species' will return a list of species that have associated GO annotations in the database being searched. 'makecache' will generate a cache of the GO ontology from whichever termsource is selected, providing a dramatic speed up of future searches if using a slow database server or large OBO file as the term source. Default is 'geneset'. -cachefile [filename] : Root of the filename to use for the cached file. Four files are generated, filename.termnames.cache, filename.children.cache, filename.obsids.cache and filename.altids.cache. -termsource ['godb','obofile','cache'] : Specifies the source of the GO ontology hierarchy. Primary sources are a GO database, or an OBO file. If a cache has previously been generated using the -q makecache switch then 'cache' can be specified. Default is 'godb'. -obofile [filename] : Name of the OBO file if using -termsource obofile. -godb [database connector] : Connection string for the GO database if one is being used. This is used in conjunction with the -gouser and -gopass switches. The example switches for connection to the EBI implementation would be: -godb 'dbi:mysql:go_latest:mysql.ebi.ac.uk:4085;mysql_compression=1' -gouser 'go_select' -gopass 'amigo'. If not set on the command line go2msig defaults to using the standard local install of the GO db as described in the installation instructions. -gouser [username] : Username for the GO mysql database. -gopass [password] : Password for the GO mysql database. -evidence [list of evidences codes] : takes a comma separated list of evidence codes which are searched for. This is ignored when using Affymetrix or Agilent annotation files as the association source. Can be 'all' for all codes. Can be negated by prefixing codes with ! (in which case you may need to put the code list in quotes). Full list of codes is EXP, IC, IDA, IEA, IEP, IGC, IGI, IMP, IPI, ISA, ISM, ISO, ISS, NAS, ND, NR, RCA, TAS, IBD, IBA, IKR, IRD, RCA. Default is 'IDA, IPI, IMP, IGI, IEP, ISS, TAS, EXP'. -taxid [ncbi taxon id] : The ncbi taxon number of the species/strain for which the gene set collection is being built. -format ['gmt'|'gmx'] : Selects gene matrix format (gmx) or gene matrix transposed (gmt) format for the output. See the GSEA data format documentation for an explanation of these. Default is 'gmt'. -maxgenes [maximum number of genes] : Gene sets where the number of genes is greater than this value are excluded. Default value is 700. -mingenes [minimum number of genes] : Gene sets where the number of genes is less than this value are excluded. Default value is 10. -nochild : If this option is set then only genes directly associated with GO terms are included in the gene set. If the option is not set then genes associated with child terms of the GO term in question will also be included in the set. Default is unset. -mapfile: [mapfile name]: Optional file which contains tab separated key value pairs for mapping the identifiers (derived from the originating NCBI or GO database, or chip annotation file) in the final gene sets to the value defined in the user supplied map file. The NCBI/GO value is used as the key. If no key exists and the -repress switch is NOT set, the existing value will be output. It the same key exists more than once in the mapping file with different values the original identifier will be expanded into each of the relevant values. -repress: By default if an original gene identifier does not have an entry in the mapfile it is left untranslated. If the repress flag is used it will instead be removed from output. This can be used to extract single species gene sets from affymetrix arrays that contain probes for multiple species. Default is unset. -geneid ['id'|'symbol'] : When obtaining associations from the ncbi gene2go table, output either the gene id (direct from the gene2go table), or translate the gene id to the gene symbol (using the geneinfo table). Alternatively if obtaining gene associations from an OBO file, use the symbol column, or the id column as the source of the gene identifier. Default is 'symbol'. -help : display this message