Immigrant: The database of immigrant genes

Information:

The Immigrant genes database contains genes that have immigrated or been transferred from one species to another. Such knowledge is useful for studies that rely on phylogenetic and/or sequence similarity information.

Currently the database contains genes predicted to have immigrated from Proteobacteria or Cyanobacteria to eukaryotes. Many of these are mitochondrial and chloroplast genes and genes that have been transferred from these organelles to the nucleus.

The database is broken up into sections where each section represents the transfer of genes from one group of organisms to another (eg. alpha Proteobacterial genes that have been transferred to eukaryotes).

The method for predicting these genes is not perfect and can miss some immigrants. At present we are trying to minimize false positives more than false negatives. Some potential false negatives can be found by looking at genes that were not predicted to be immigrants but that are annotated in Swissprot to function in the mitochondria or chloroplast. Just because a gene functions in an organelle does not necessarily mean that it was transferred from a bacteria, but many of these genes probably came from Proteobacteria or Cyanobacteria. Here is a list of some genes that were not predicted by our methods but had a Swissprot entry with a TRANSIT key in the feature table, the string "mitoch" or "chloro" in the SUBCELLULAR LOCATION, or the string "mitoch" or "chloro" in the OG field. A TRANSIT key tells you that the protein has a transit peptide (mitochondrial, chloroplastic, thylakoid, cyanelle or for a microbody). The OG field tells what non-nuclear DNA the gene is encoded on. The "SUBCELLULAR LOCATION" topic of the comment block describes the subcellular location of the mature protein.

The general reason for false negatives is that the bacterial relative is not sufficiently similar to the eukaryotic gene, which is caused by differential mutation rates. Often there are other horizontal gene transfers that muck up the prediction. For example, a mitochondrial gene could be transferred to some Eukaryotic bacterial parasite making the gene look more similar to this bacteria than to the Proteobacteria. In some cases the gene in the Proteobacteria could have been transferred to another bacteria or vic versa. Even if this happened before the mitochondrion evolved, these bacterial genes could be closer to each other than the proteobacterial gene is to the mitochondrial gene.

If you find false positives/negatives or have questions please let me know.

Overview of how the immigrant genes are predicted:

For each bacterial relative (eg. each gene in the Rickettsia prowazekii genome) the following is performed:

The bacterial protein is used as a blast query against a database containing all the proteins in Swissprot, Trembl and Trembl_new.
The alignments from the blast output with e-values less than or equal to 1e-3 are extracted into a multiple sequence alignment.
If there are fewer than 100 sequences they are re-aligned using clustalw.
Columns in the alignment containing greater than 50% gaps are removed. Sequences that have residues in fewer than 75% of the remaining columns are removed.
A boot-strapped neighbor joining tree is calculated with this alignment using clustalw.
The tree is rooted at the point that is equidistant from the two farthest points of the tree.
If a branch in the tree contains ONLY the query, zero or more other Proteobacteria and eukaryotes, the eukaryotes are predicted to be immigrant genes from Proteobacteria to eukaryotes.
Various manual methods are used to remove false positives and add false negatives.