[Bioclusters] Nomenclature (was Re: Call for information.)

Fri, 19 Apr 2002 14:51:35 -0500 (CDT)

> Proposal 0: blast human chromosome 22 (query) against the genome of 
> pufferfish database).  Both sequences repeat-masked, E-value 10^4.

I think that there are two things to remember when we're building this
set of "standard" tests:

* BLAST is an appoximation.  A lot of folks are out there comparing
  their accelerated / MPI / super-sensitive / java-implemented
  homology search algorithm against NCBI's BLAST.  When there are
  differences, it's impossible to tell whether they represent
  added sensitivity, lost specificity, or simply noise.

  It's really shocking to me that peer reviewed publications accept
  and publish what amounts to comparisons without controls.  

  Instead BOTH approximations should be compared made to some complete
  algorithm implementing a search in the same space.  For BLAST, one
  of these is the 1981 work by Smith & Wateraman.  For others, it's
  less well defined.

  Better yet would be a comparison to some well annotated set of
  biologically "correct" homologies.  These would be laboratory
  verified, biologist approved homologs.  This methodology was used
  for creating the original PAM and BLOSUM matrixes, but seems to have
  fallen by the wayside with the data explosion of the last decade.

* (more important) We computer folks need to address biologically
  interesting questions, not just computationally interesting ones.
  Sure, chromosomes are the biggest biological strings we've got, and
  BLAST is the hammer that's at the top of the toolbox.  Does that
  really make it the appropriate tool to the task?  

  Unless you're pretty clever about your BLAST parameters, all this
  test is going to show are some well-documented weaknesses of BLAST
  when you hit queries of ridiculous size. 

  Those weaknesses are there because it was never designed (the
  algorithm, not the implementation) with chromosomes in mind.  
  "Local Alignments" were the goal, not large scale genomic
  archeology.   

That said, here are some of the questions  that I ask when I get hold
of a new, great, wonderful sequence based homology tool, I run ALL of
these and see where the new tool shines.  Then I describe the tool to
my users and watch to see if there's any interest.

None of the ones I've tried are good at all of these cases. Some are
good for none. :)   I don't claim that it's a complete list, but it's
a start. 

0) Query: EST (500-800bp, single pass sequencing - meaning they're
               positively riddled with errors) 
   Target: EST-unigene set from a single organism.  I like to use 
      Medicago.  About 140,000 EST reads, which collapse to between 30
      and 40,000 contigs.
   Search: Forward strand vs. forward strand.  Theoretically, 
      we know the reading frame for mRNA based clones.
   Performance Criteria:  Accuracy vs. Smith & Waterman results
   Results Criteria: Response time for a single query; Throughput for
      large batch queries.

1) Query: Protein 
   Target: NCBI NR 
      Just like above.  This one doesn't seperate things out at 

2) Query:  EST reads
   Target: Whole Chromosome (I generally use Arabidopsis, since we're
     a plant lab).
   Performance Criteria:  Response time.  Batch throughput
   Results Criteria:  Hits found vs. a target set where I used a
     sliding window to chop up the chromosome into sequences of 
     10,000bp.  You'd probably be surprised at the differences in the 
     hits.

3) Query:  Whole chromosome
   Target  Whole chromosome
   Performance Criteria:  Response time.
   Results Criteria:  Chop up both chromosomes.  This one 
     gets really hairy to define the "right" answers.  You start 
     to encounter all those good "shadowing" and normalization 
     questions that get really confusing.  

Thanks for listening.

-Chris Dwan
 University of Minnesota