﻿genealiases

Copyright 2008 Roney S. Coimbra

How to contact the author: roney.s.coimbra@ufrnet.br

genealiases is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3 of the License.

genealiases is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with genealiases (file: COPYING).  If not, see <http://www.gnu.org/licenses/>.



Before you start, you’ll need to replace the first line with your Perl location. Do it in all programs:

example:
#!/usr/bin/perl –w

Make sure that you have the following Perl modules installed in your system:

LWP::Simple
Lingua::EN::Inflect

The package is composed by a series of programs that are launched in cascade by the controller, genealiases.pl. Examples of input and output files are provided in the folder “Example”.

genealiases.pl

This interative program will ask you for the parameters bellow and launch the other programs in cascade:

Arguments: 

list of canonical gene names and their aliases (official symbol followed by its aliases and three unrelated official symbols, tab separated; one gene per line. You can test the program using test10genes in folder Example) 

baseline abstract set (gene name, PMID, title + abstract; tab separated; one per line - you can use baseline_abstracts.tab, in folder Aditional_files of this distribution)

process baseline abstract set? (stores results in DBM format)

dictionary of terms to be excluded by default (you can use excluded_terms.txt, in folder Aditional_files of this this distribution. You can also edit it adding your terms, one per line)

maximal number of abstracts per query to be fetched from Medline at NCBI – default = 100

use filters (yes or no) - default = yes

select entries with at least X abstract in Pubmed - default = 1

select words occurring in at least ms items (genes, aliases, etc)  - default = 2

cut-off value for baseline - default = 0.05

t value for filter 2 - default = 0.15
k value for filter 2 - default = 1.5

term will pass this filter if:
1) term does not occur in baseline and its frequency >= (t + (k / number of abstracts), or
2)term occurs in baseline and its frequency in baseline <= cutoff baseline, term frequency - term frequency in baseline > = (t + (k / number of abstracts).

For entities with 5 or less abstracts, “number of abstracts” is set to 5

discards words present in abstracts of more than the 1/f4 of all entries - default = 1

The output comprises: 

1) one folder per each official gene symbol containing:
 	*.tab (text corpora);
 	*.wordfreq (table of term frequencies X gene name/aliases)

2) *.jac (jaccard distances)

3) *.res (alias classification: “ambiguous” or “synonym”; internal controls are labelled  “unrelated”)