[BiO BB] ortholog

Wed Sep 12 15:49:14 EDT 2007

>I'm trying to find out thousands of genes' ortholog from
>ENSEMBL. Seems hard to get a clear and direct way to achive it. Any
>suggestion is invited( or you can suggest a better database for orthologs
>)!
>

I'm not sure if these are competitive yet with the web based tools but I'm 
developing
a bunch of scripts for automated search and analysis. If anyone cares to 
comment
on strengths or limitations of existing tools it may help me fill some voids 
( make these things
useful to others).
Essentially everything uses
the NCBI eutils facilities supplemented with some local databases or rules. 
For example,
I have a bunch of scripts to find a contig in the dog genome and put it into
something called ex_fasta. Then, I bury a bunch of blast searches and text 
formatiing
into a one-liner( the option names are a bit odd because I make them up out 
of
prior combinations as needed in a task-specific way ):

$progpath/findhomologues -de_novo_stuff ex_fasta

The above also creates a bmp file with a bunch of annotations and clustalw 
alignments between blast hits to various databases including some local 
repeat and probe collections.
Right now, I'm adding  a rule-based alignment and annotation system. I've 
got a collection of
Perl REGEX patterns in an XML file along with biblio info ( where it came 
from, etc) that
I can parse into something simple:
./yaxml -parse rule_source.xml -rules > algn_rules

$ cat algn_rules
ATG >rule|1|DNA Start Codon
(?<=TATA.*)(GT.*?AT)(?=.*ATAAA) >rule|4|DNA Composite Introns
ATG(...)*?(TAG|TAA|TGA) >rule|5|DNA Euk ORF
MGSGSSS >rule|9|PEPTIDE N-myristoylation pattern
[CA](AG|GTA|GTG)AGT >rule|10|DNA? splice donor
[CT]+[A-Z][CT]A{0,1}G >rule|11|DNA? splice acceptor
N[^P][ST][^P] >rule|14|PEPTIDE Glycosylation site
[ST].N. >rule|15|PEPTIDE Glycosylation site
Y..[LI].{6,8}Y..[LI] >rule|16|PEPTIDE ITAM,Fc cytoplasmic tail

And use for alignment cues:

$progpath/rules_annotater -clean -which 1 -fastas o2_fasta -rules 
$progpath/align_rules > r3nunu2

That then output in text or graphical bmp files either alignments of just 
stats:

$ $progpath/mm_align_tool -fastas o2_fasta -rules r3nunu -rules r3nunu2 
-use_rule 4 -stats -align -output notes
For Rules set 0:>ref|NW_876253.1|Cfa11_WGA39_2:47189155-47195387 Canis 
familiar
is chromosome 11 genomic contig, whole genome shotgun sequence
388        >rule|2|DNA Stop Codon
344        >rule|11|DNA? splice acceptor
189        >rule|4|DNA Composite Introns
128        >rule|1|DNA Start Codon
58         >rule|5|DNA Euk ORF
34         >rule|6|DNA Euk spliced ORF
15         >rule|12|DNA? polyadenlyation signal
6          >rule|3|DNA TATA box
3          >rule|10|DNA? splice donor
For Rules set 1:>gb|AACN010493556.1|:1-1146 Canis familiaris 
ctg19866850213054,
whole genome shotgun sequence
72         >rule|11|DNA? splice acceptor
60         >rule|2|DNA Stop Codon
24         >rule|1|DNA Start Codon
21         >rule|4|DNA Composite Introns
6          >rule|5|DNA Euk ORF
5          >rule|6|DNA Euk spliced ORF
1          >rule|10|DNA? splice donor
1          >rule|12|DNA? polyadenlyation signal
1          >rule|3|DNA TATA box

I'm still debugging this but initial alignment with rules was about what I 
expected, now I'm working
on automating the analysis and interpretation. I've also got a bunch of test 
scripts that, for example, grab two random and distinct pieces of dog or 
human genome and try to align or otherwise "match" them- handy for control 
and finding sequences that occur a lot.

_________________________________________________________________
Can you find the hidden words?  Take a break and play Seekadoo! 
http://club.live.com/seekadoo.aspx?icid=seek_hotmailtextlink1