[BiO BB] cDNA Library Subtraction - Bioinformatics
Chris Dwan (CCGB)
cdwan at mail.ahc.umn.edu
Sat Dec 13 17:27:16 EST 2003
> Using the current state of the art bioinformatics tools/software, what is
> the preferred method of *identifying EST sequences* for the subtraction
> procedure of a cDNA library ?
This is an interesting question to me, since the answer is so clearly a
protocol combining a variety of existing tools rather than a single
general tool. BioPerl is an excellent framework for scripting such
> In order to decrease the abundant messages which dominate cDNA libraries,
> I hope to identify the longest, most abundant, and annotatable (based on
> e.g. swissprot) ESTs. I would like to get expert opinions on how to most
> effectively go about it. I have several thousand ESTs and would, for at
> least this first round, like to identify 96 clones which are the most
We have a pipeline to perform almost exactly the opposite analysis (find
novel genes with no obvious homologs), some parts of which might be
useful to you:
1) Perform quality analysis on each EST
- Trim low quality reads from both ends until the sequence is
at most K percent ambiguous bases. K varies depending on the
- Look for the primer / linker site at each end of the sequence and
remove it if found. Leaving these in makes for *great*
anchors for spurious assemblies of contigs.
- BLAST against E. Coli as well as popular phage sequences to look for
- BLAST against the human chromosomes to look for contamination
2) Assemble contigs
- We use phrap, mostly because we have some expertise with it.
There are other options available. My opinion is that it's better
to use a tool that is well understood at your lab than to try to
learn an unknown that may or may not be better.
3) (optional) go through the contigs and break up those that do
not have good support across their entire length. This is currently
a real pain, but hopefully we'll have an automated system trained
"real soon now."
4) BLASTX contigs vs PIR-NREF (again, a local favorite). Anything that
can be annotated this way, remove from further steps
5) TBLASTX contigs vs. NCBI NT.
6) Further manual analysis using HMMER and other tools.
> Does anything exist in bioperl which performs the necessary sequence
> analysis for subtraction of a cDNA library?
All of these piece-parts can be scripted using BioPerl, but I'm not aware
of any single general tool that does exactly what you're looking for.
I am very interested to hear about how other shops do their EST analysis
University of Minnesota
More information about the BBB