Ivo, How is this benchmark useful? The number of available sequenced chromosomes grows very slowly. Unless I'm missing something really fundamental, that means that the number of possible searches of this type must also grow fairly slowly (slow squared, in fact). If we cache results, I think, chromosome on chromosome BLASTs will never be a dominant part of the computational load. This is good unless we simply want to show how hard we can make the servers work by doing PRECISELY the same job over and over again. If I'm wrong about this, I hope that someone will let me know in what way. Anyway, assuming that this *is* the sort of job we want to run. Can we be smarter than just dumping in two sequences (one formatdb'd) and letting it run? I would rather benchmark the smart way of doing it, at the very least. The run you describe will return the top 500 (by default) local alignments between the two chromosomes. This is fine and good, but it represents a tiny fraction of the similarities that are interesting to the genomic scientists with whom I work. An example: one of my users really likes to make chromosome / chromosome similarity maps (sparse matrixes) for what he calls "genome archeology." This is THE classic reason for doing genome on genome BLASTs. We do it by breaking each chromosome into overlapping chunks of arbitrary size (say, 10,000bp with 1,000bp overlap) and then doing an "all vs. all" set of BLASTs. This way of setting up the problem has three benefits: ------------------------------------------------------ * It avoids any weirdness resulting from BLASTing at the megabase scale (which I've observed, but never taken the time to figure out in detail because the workaround is so simple). * Turns the problem from a worst case scenario for BLAST into one that's trivially parallelizable * Produces a far richer set of results than the top 500 local alignments overall. As a side benefit, benchmarking on this this scenario maps *precisely* to a bioinformatic problem that *will* continue to grow without bound: cDNA library vs. genome target searches. An interesting (and more thorough) system benchmark would look for an optimal chunk size for the system in question. Please let me know what I've missed. Respectfully, Chris Dwan > I wonder if anyone could do the following benchmark (on a dual-G4) and > publish the results? > > Blast human chromosomes 21 and 22 (query sequences) against the genome > of pufferfish (database). Use default parameters for blast (i.e., no > word length of 40, etc.) and an E-value of 10^{-4}. If possible, > submit both jobs independently and simultaneously, so that one CPU is > blasting chr 21, while the other CPU is blasting chr 22. Thanks!