Hi Dan, Thanks for the advice. I am afraid that what you have suggested is basically to hard for me to do without a lot of work for each entry. I don't know a way to automate the 90% match for thousands of entries vs thousands of possible targets. I am thinking could probably pull down a BLAST program and work with that, but that is more effort than I want to put in right now. Is there some "easy" way to do this? I also dont' see a way to analyze abstracts as you have described without looking at each one individually. Is there a way to automate something like this? I have found several online servers which do predictions, that is working fairly well for me and I think that is what I will go with. I will send a summary of what I found and used when I finish. Thanks again, Ethan -----Original Message----- From: biodevelopers-bounces+ethan.strauss=promega.com at bioinformatics.org [mailto:biodevelopers-bounces+ethan.strauss=promega.com at bioinformatics.o rg] On Behalf Of dmb at mrc-dunn.cam.ac.uk Sent: Wednesday, November 23, 2005 3:07 PM To: General discussions about software development in bioinformatics Cc: biodevelopers at bioinformatics.org Subject: Re: [Biodevelopers] Subcellular localization? > Hi, > I have a list of about 2500 accession numbers from Genbank Refseq. > All of them are human coding sequences and I can easily get the > complete sequence and other information from Genbank, but I can't > figure out a way to get subcellular localization information. I have > pulled some data from UniProt and from DBSubLoc > (http://www.bioinfo.tsinghua.edu.cn/dbsubloc.html) and have been able > to match about 10% of my sequences to subcellular localizations from > these databases, but that still leaves about 90% unknown. One problem > is that I can't find a way to match Genbank Accession # with the IDs > in Swiss-Prot and DBSubLoc. I have just gone on sequence identity (So > far I only call it a match when it is 100% identical). > Do you have any ideas about how I can get subcellular localization > info for the rest of my sequences? > Thanks for any help or suggestions! > Ethan Hey Ethan. What coverage do you get if you move to 90% identity matching? Dual localisation and 'localisation shift' in evolution could cause you problems, but my feeling is that very similar sequences will have similar localisations. For the remainder you could try running software to predict localisation signal peptides. I have no idea which software is best, and how reliable these can be (its a whole sub field in itself), but probably worth investigation as part of an overall assignment strategy. One strategy I was thinking about for minimum effort (to help a colleague of mine) was to use the GI to PMID links in the Gene database at the NCBI, and then lookup keywords in the article abstracts (or keywords section) of PUBMED. So you could find all abstracts that mention words to do with experimental localisation techniques (I don't have a list, but we should make one somewhere - biowiki?) and specific localisations, and then link all those abstracts to genes. This is a very rough and ready approach, but gives you (hopefully) a lot of data, so you can measure reliability of assignment by 'weight' of data for a certain gene. So you may find - 'fluorescence tagging' + endoplasmic reticulum in a certain paper, which is linked to 5 genes by the Gene database at the NCBI. Additionally you could use pre-computed go annotation of pubmed articles to link to genes by PMID. I think if done right these three approaches (homology, signal sequences, literature mining) should help you a lot, but I didn't try any of them personally. All the best, Dan. P.S. The GI <-> ACCN mapping is a perennial problem. Try searching UniParc? > Ethan Strauss Ph.D. > Bioinformatics Scientist > Promega Corporation > 2800 Woods Hollow Rd. > Madison, WI 53711 > 608-274-4330 > 800-356-9526 > ethan.strauss at promega.com > > _______________________________________________ > Biodevelopers mailing list > Biodevelopers at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/biodevelopers > _______________________________________________ Biodevelopers mailing list Biodevelopers at bioinformatics.org https://bioinformatics.org/mailman/listinfo/biodevelopers