[Bioclusters] Nomenclature (was Re: Call for information.)
Chris Dwan (CCGB)
bioclusters@bioinformatics.org
Fri, 19 Apr 2002 14:51:35 -0500 (CDT)
> Proposal 0: blast human chromosome 22 (query) against the genome of
> pufferfish database). Both sequences repeat-masked, E-value 10^4.
I think that there are two things to remember when we're building this
set of "standard" tests:
* BLAST is an appoximation. A lot of folks are out there comparing
their accelerated / MPI / super-sensitive / java-implemented
homology search algorithm against NCBI's BLAST. When there are
differences, it's impossible to tell whether they represent
added sensitivity, lost specificity, or simply noise.
It's really shocking to me that peer reviewed publications accept
and publish what amounts to comparisons without controls.
Instead BOTH approximations should be compared made to some complete
algorithm implementing a search in the same space. For BLAST, one
of these is the 1981 work by Smith & Wateraman. For others, it's
less well defined.
Better yet would be a comparison to some well annotated set of
biologically "correct" homologies. These would be laboratory
verified, biologist approved homologs. This methodology was used
for creating the original PAM and BLOSUM matrixes, but seems to have
fallen by the wayside with the data explosion of the last decade.
* (more important) We computer folks need to address biologically
interesting questions, not just computationally interesting ones.
Sure, chromosomes are the biggest biological strings we've got, and
BLAST is the hammer that's at the top of the toolbox. Does that
really make it the appropriate tool to the task?
Unless you're pretty clever about your BLAST parameters, all this
test is going to show are some well-documented weaknesses of BLAST
when you hit queries of ridiculous size.
Those weaknesses are there because it was never designed (the
algorithm, not the implementation) with chromosomes in mind.
"Local Alignments" were the goal, not large scale genomic
archeology.
That said, here are some of the questions that I ask when I get hold
of a new, great, wonderful sequence based homology tool, I run ALL of
these and see where the new tool shines. Then I describe the tool to
my users and watch to see if there's any interest.
None of the ones I've tried are good at all of these cases. Some are
good for none. :) I don't claim that it's a complete list, but it's
a start.
0) Query: EST (500-800bp, single pass sequencing - meaning they're
positively riddled with errors)
Target: EST-unigene set from a single organism. I like to use
Medicago. About 140,000 EST reads, which collapse to between 30
and 40,000 contigs.
Search: Forward strand vs. forward strand. Theoretically,
we know the reading frame for mRNA based clones.
Performance Criteria: Accuracy vs. Smith & Waterman results
Results Criteria: Response time for a single query; Throughput for
large batch queries.
1) Query: Protein
Target: NCBI NR
Just like above. This one doesn't seperate things out at
2) Query: EST reads
Target: Whole Chromosome (I generally use Arabidopsis, since we're
a plant lab).
Performance Criteria: Response time. Batch throughput
Results Criteria: Hits found vs. a target set where I used a
sliding window to chop up the chromosome into sequences of
10,000bp. You'd probably be surprised at the differences in the
hits.
3) Query: Whole chromosome
Target Whole chromosome
Performance Criteria: Response time.
Results Criteria: Chop up both chromosomes. This one
gets really hairy to define the "right" answers. You start
to encounter all those good "shadowing" and normalization
questions that get really confusing.
Thanks for listening.
-Chris Dwan
University of Minnesota