> Hi, > I have a list of about 2500 accession numbers from Genbank Refseq. > All of them are human coding sequences and I can easily get the complete > sequence and other information from Genbank, but I can't figure out a > way to get subcellular localization information. I have pulled some data > from UniProt and from DBSubLoc > (http://www.bioinfo.tsinghua.edu.cn/dbsubloc.html) and have been able to > match about 10% of my sequences to subcellular localizations from these > databases, but that still leaves about 90% unknown. One problem is that > I can't find a way to match Genbank Accession # with the IDs in > Swiss-Prot and DBSubLoc. I have just gone on sequence identity (So far I > only call it a match when it is 100% identical). > Do you have any ideas about how I can get subcellular localization info > for the rest of my sequences? > Thanks for any help or suggestions! > Ethan Hey Ethan. What coverage do you get if you move to 90% identity matching? Dual localisation and 'localisation shift' in evolution could cause you problems, but my feeling is that very similar sequences will have similar localisations. For the remainder you could try running software to predict localisation signal peptides. I have no idea which software is best, and how reliable these can be (its a whole sub field in itself), but probably worth investigation as part of an overall assignment strategy. One strategy I was thinking about for minimum effort (to help a colleague of mine) was to use the GI to PMID links in the Gene database at the NCBI, and then lookup keywords in the article abstracts (or keywords section) of PUBMED. So you could find all abstracts that mention words to do with experimental localisation techniques (I don't have a list, but we should make one somewhere - biowiki?) and specific localisations, and then link all those abstracts to genes. This is a very rough and ready approach, but gives you (hopefully) a lot of data, so you can measure reliability of assignment by 'weight' of data for a certain gene. So you may find - 'fluorescence tagging' + endoplasmic reticulum in a certain paper, which is linked to 5 genes by the Gene database at the NCBI. Additionally you could use pre-computed go annotation of pubmed articles to link to genes by PMID. I think if done right these three approaches (homology, signal sequences, literature mining) should help you a lot, but I didn't try any of them personally. All the best, Dan. P.S. The GI <-> ACCN mapping is a perennial problem. Try searching UniParc? > Ethan Strauss Ph.D. > Bioinformatics Scientist > Promega Corporation > 2800 Woods Hollow Rd. > Madison, WI 53711 > 608-274-4330 > 800-356-9526 > ethan.strauss at promega.com > > _______________________________________________ > Biodevelopers mailing list > Biodevelopers at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/biodevelopers >