[BiO BB] ortholog
Mike Marchywka
marchywka at hotmail.com
Wed Sep 12 15:49:14 EDT 2007
>I'm trying to find out thousands of genes' ortholog from
>ENSEMBL. Seems hard to get a clear and direct way to achive it. Any
>suggestion is invited( or you can suggest a better database for orthologs
>)!
>
I'm not sure if these are competitive yet with the web based tools but I'm
developing
a bunch of scripts for automated search and analysis. If anyone cares to
comment
on strengths or limitations of existing tools it may help me fill some voids
( make these things
useful to others).
Essentially everything uses
the NCBI eutils facilities supplemented with some local databases or rules.
For example,
I have a bunch of scripts to find a contig in the dog genome and put it into
something called ex_fasta. Then, I bury a bunch of blast searches and text
formatiing
into a one-liner( the option names are a bit odd because I make them up out
of
prior combinations as needed in a task-specific way ):
$progpath/findhomologues -de_novo_stuff ex_fasta
The above also creates a bmp file with a bunch of annotations and clustalw
alignments between blast hits to various databases including some local
repeat and probe collections.
Right now, I'm adding a rule-based alignment and annotation system. I've
got a collection of
Perl REGEX patterns in an XML file along with biblio info ( where it came
from, etc) that
I can parse into something simple:
./yaxml -parse rule_source.xml -rules > algn_rules
$ cat algn_rules
ATG >rule|1|DNA Start Codon
(?<=TATA.*)(GT.*?AT)(?=.*ATAAA) >rule|4|DNA Composite Introns
ATG(...)*?(TAG|TAA|TGA) >rule|5|DNA Euk ORF
MGSGSSS >rule|9|PEPTIDE N-myristoylation pattern
[CA](AG|GTA|GTG)AGT >rule|10|DNA? splice donor
[CT]+[A-Z][CT]A{0,1}G >rule|11|DNA? splice acceptor
N[^P][ST][^P] >rule|14|PEPTIDE Glycosylation site
[ST].N. >rule|15|PEPTIDE Glycosylation site
Y..[LI].{6,8}Y..[LI] >rule|16|PEPTIDE ITAM,Fc cytoplasmic tail
And use for alignment cues:
$progpath/rules_annotater -clean -which 1 -fastas o2_fasta -rules
$progpath/align_rules > r3nunu2
That then output in text or graphical bmp files either alignments of just
stats:
$ $progpath/mm_align_tool -fastas o2_fasta -rules r3nunu -rules r3nunu2
-use_rule 4 -stats -align -output notes
For Rules set 0:>ref|NW_876253.1|Cfa11_WGA39_2:47189155-47195387 Canis
familiar
is chromosome 11 genomic contig, whole genome shotgun sequence
388 >rule|2|DNA Stop Codon
344 >rule|11|DNA? splice acceptor
189 >rule|4|DNA Composite Introns
128 >rule|1|DNA Start Codon
58 >rule|5|DNA Euk ORF
34 >rule|6|DNA Euk spliced ORF
15 >rule|12|DNA? polyadenlyation signal
6 >rule|3|DNA TATA box
3 >rule|10|DNA? splice donor
For Rules set 1:>gb|AACN010493556.1|:1-1146 Canis familiaris
ctg19866850213054,
whole genome shotgun sequence
72 >rule|11|DNA? splice acceptor
60 >rule|2|DNA Stop Codon
24 >rule|1|DNA Start Codon
21 >rule|4|DNA Composite Introns
6 >rule|5|DNA Euk ORF
5 >rule|6|DNA Euk spliced ORF
1 >rule|10|DNA? splice donor
1 >rule|12|DNA? polyadenlyation signal
1 >rule|3|DNA TATA box
I'm still debugging this but initial alignment with rules was about what I
expected, now I'm working
on automating the analysis and interpretation. I've also got a bunch of test
scripts that, for example, grab two random and distinct pieces of dog or
human genome and try to align or otherwise "match" them- handy for control
and finding sequences that occur a lot.
_________________________________________________________________
Can you find the hidden words? Take a break and play Seekadoo!
http://club.live.com/seekadoo.aspx?icid=seek_hotmailtextlink1
More information about the BBB
mailing list