[BiO BB] cDNA Library Subtraction - Bioinformatics

Sat Dec 13 17:27:16 EST 2003

> Using the current state of the art bioinformatics tools/software, what is
> the preferred method of *identifying EST sequences* for the subtraction
> procedure of a cDNA library ?

This is an interesting question to me, since the answer is so clearly a
protocol combining a variety of existing tools rather than a single
general tool.  BioPerl is an excellent framework for scripting such
protocols.

> In order to decrease the abundant messages which dominate cDNA libraries,
> I hope to identify the longest, most abundant, and annotatable (based on
> e.g. swissprot) ESTs.  I would like to get expert opinions on how to most
> effectively go about it.  I have several thousand ESTs and would, for at
> least this first round, like to identify 96 clones which are the most
> abundant/longest/annotatable.

We have a pipeline to perform almost exactly the opposite analysis (find
novel genes with no obvious homologs), some parts of which might be
useful to you:

1) Perform quality analysis on each EST
   - Trim low quality reads from both ends until the sequence is
     at most K percent ambiguous bases.  K varies depending on the
     experiment.
   - Look for the primer / linker site at each end of the sequence and
     remove it if found.  Leaving these in makes for *great*
     anchors for spurious assemblies of contigs.
   - BLAST against E. Coli as well as popular phage sequences to look for
     obvious contamination.
   - BLAST against the human chromosomes to look for contamination

2) Assemble contigs
   - We use phrap, mostly because we have some expertise with it.
     There are other options available.  My opinion is that it's better
     to use a tool that is well understood at your lab than to try to
     learn an unknown that may or may not be better.

3) (optional) go through the contigs and break up those that do
   not have good support across their entire length.  This is currently
   a real pain, but hopefully we'll have an automated system trained
   "real soon now."

4) BLASTX contigs vs PIR-NREF (again, a local favorite).  Anything that
   can be annotated this way, remove from further steps

5) TBLASTX contigs vs. NCBI NT.

6) Further manual analysis using HMMER and other tools.

> Does anything exist in bioperl which performs the necessary sequence
> analysis for subtraction of a cDNA library?

All of these piece-parts can be scripted using BioPerl, but I'm not aware
of any single general tool that does exactly what you're looking for.

I am very interested to hear about how other shops do their EST analysis
these days.

-Chris Dwan
 University of Minnesota