Hi Tristan, I don't have too much experience with this stuff, but since noone else has piped in: Assuming these are human sequences, I'd take a non-redundant set of human protein sequences and compare each est sequence against the protein sequences using blastx. I think it would be preferable to do a blastn against annotated CDS (nucleotide) sequences, but I'm not sure where you can get a nice curated set of those, but the protein sequences would probably do the trick. Bioperl has modules to parse the blast output (and run it in fact, but since it's just one format and one big query file, I'd just run it manually). So with bioperl parsing, it's easy to write a perl script that will tally up all the best hits by protein id's and report totals. You can then examine what you get to see if the results that are most redundant are good choices. Am I missing something, though? What's the advantage to assembling them first? --Nancy ************************************* Nancy F. Hansen, PhD nhansen at nhgri.nih.gov Bioinformatics Group NIH Intramural Sequencing Center (NISC) 8717 Grovemont Circle, Rm. 152L Gaithersburg, MD 20877 Phone: (301) 435-1560 Fax: (301) 435-6170 On Tue, 16 Dec 2003 ssml-general-request at bioinformatics.org wrote: > When replying, PLEASE edit your Subject line so it is more specific > than "Re: ssml-general digest, Vol..." > > > Today's Topics: > > 1. [Fwd: Re: Request to mailing list ssml-general rejected] (Dan Bolser) > > --__--__-- > > Message: 1 > Date: Tue, 16 Dec 2003 09:20:15 -0000 (GMT) > From: "Dan Bolser" <dmb at mrc-dunn.cam.ac.uk> > To: <ssml-general at bioinformatics.org> > Subject: [ssml] [Fwd: Re: Request to mailing list ssml-general rejected] > > Using the current state of the art bioinformatics tools/software, what is the > preferred method of *identifying EST sequences* for the subtraction procedure of a > cDNA library ? > > In order to decrease the abundant messages which dominate cDNA libraries, I hope > to identify the longest, most abundant, and annotatable (based on e.g. swissprot) > ESTs. I would like to get expert opinions on how to most effectively go about it. > I have several thousand ESTs and would, for at least this first round, like to > identify 96 clones which are the most abundant/longest/annotatable. > > Approaches I have considered are : > > 1. Running the entire dataset through CAP3 to produce contigs. Then take the > consensus sequence for each contig and run a blastp against swissprot to see if is > annotatable. > 2. Running an all against all blast search using the ESTs as both the query and > the database. Additionally, one could make the database a combination of both the > ESTs and swissprot, thus indicating not only which sequences have > similar/identical matches within the EST database, but also whether they have a > homolog in swissprot > > Does anything exist in bioperl which performs the necessary sequence analysis for > subtraction of a cDNA library? > > BTW, if these are not the correct listserv/bulletin boards for such a query, > please let me know the preferred location. > > Thank you and Happy Holidays! > > Tristan Fiedler > > > > > -- > Tristan J. Fiedler, Ph.D. > Postdoctoral Research Fellow - Walsh Laboratory > NIEHS Marine & Freshwater Biomedical Sciences Center > Rosenstiel School of Marine & Atmospheric Sciences > University of Miami > > tfiedler at rsmas.miami.edu > t.fiedler at umiami.edu (alias) > 305-361-4626 > > > > > > --__--__-- > > _______________________________________________ > ssml-general mailing list > ssml-general at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/ssml-general > > > End of ssml-general Digest >