[Bioclusters] Apple/Genentech BLAST

Thu, 23 May 2002 15:10:48 -0500 (CDT)

Ivo,

How is this benchmark useful?

The number of available sequenced chromosomes grows very
slowly. Unless I'm missing something really fundamental, that means
that the number of possible searches of this type must also grow
fairly slowly (slow squared, in fact). 

If we cache results, I think, chromosome on chromosome BLASTs will
never be a dominant part of the computational load.  This is good
unless we simply want to show how hard we can make the servers work by
doing PRECISELY the same job over and over again.

If I'm wrong about this, I hope that someone will let me know in what
way.  

Anyway, assuming that this *is* the sort of job we want to run.  Can
we be smarter than just dumping in two sequences (one formatdb'd) and
letting it run?  I would rather benchmark the smart way of doing it,
at the very least.

The run you describe will return the top 500 (by default) local
alignments between the two chromosomes.  This is fine and good, but it
represents a tiny fraction of the similarities that are interesting to
the genomic scientists with whom I work.

An example:  one of my users really likes to make chromosome /
chromosome similarity maps (sparse matrixes) for what he calls "genome
archeology." This is THE classic reason for doing genome on genome
BLASTs.  We do it by breaking each chromosome into overlapping chunks
of arbitrary size (say, 10,000bp with 1,000bp overlap) and then doing
an "all vs. all" set of BLASTs.

This way of setting up the problem has three benefits:
------------------------------------------------------
 * It avoids any weirdness resulting from BLASTing at the megabase
   scale (which I've observed, but never taken the time to figure out
   in detail because the workaround is so simple).

 * Turns the problem from a worst case scenario for BLAST into one
   that's trivially parallelizable 

 * Produces a far richer set of results than the top 500 local  
   alignments overall.

As a side benefit, benchmarking on this this scenario maps *precisely*
to a bioinformatic problem that *will* continue to grow without bound:
cDNA library vs. genome target searches.

An interesting (and more thorough) system benchmark would look for an
optimal chunk size for the system in question.

Please let me know what I've missed.  

  Respectfully,
  Chris Dwan

> I wonder if anyone could do the following benchmark (on a dual-G4) and 
> publish the results?
> 
> Blast human chromosomes 21 and 22 (query sequences) against the genome 
> of pufferfish (database).  Use default parameters for blast (i.e., no 
> word length of 40, etc.) and an E-value of 10^{-4}.  If possible, 
> submit both jobs independently and simultaneously, so that one CPU is 
> blasting chr 21, while the other CPU is blasting chr 22.  Thanks!