Hi Chris, thanks for your detailed comments. "Chris Dwan (CCGB)" <cdwan@mail.ahc.umn.edu> wrote on Thu, 23 May 2002: > Ivo, > > How is this benchmark useful? See below. > The number of available sequenced chromosomes grows very > slowly. Unless I'm missing something really fundamental, See below. > that means > that the number of possible searches of this type must also grow > fairly slowly (slow squared, in fact). Slow squared can be fast. :-) For example, if slow = linear, then slow squared = quadratic. But that is not my main point, see below. > Anyway, assuming that this *is* the sort of job we want to run. We need to define what we mean by "the sort of job." I mean that 90% of our cluster load comes from Blast. It doesn't mean that 90% of the jobs are Blast jobs, but it means that Blast is one of the slowest jobs in our case. Hence, it makes sense (for us) to use "some sort" of Blast job as benchmark. Instead of chr 21 and 22 and pufferfish I could have picked (almost) any other triple of sequences, but I picked chr 22 and 21 versus pufferfish for the following simple reasons: - human and pufferfish have an evolutionary distance that is typical for our applications. Right now we focus on human-mouse comparisons, but in the near future we will move to more distant (from human) organisms, and pufferfish is a good example in this respect. - those sequences have a length that seemed optimal for a benchmark: the jobs will run for a few hours, not only for a few minutes, and not for a few days. The first point indicates why the benchmark Blast comparison of human with chimpanzee (with word size up to 40) done by Apple is not too relevant for us, and why we would like to see a benchmark with two more distant organisms. > Can we be smarter than just dumping in two sequences (one formatdb'd) and > letting it run? I would rather benchmark the smart way of doing it, > at the very least. Well, I didn't state we should do this comparison in a non-smart way. If you read my posts from the past, you will find that we always run Blast jobs in a mode that you call a smart way. I actually don't know if it is smart or not, but we always cut the query sequence(s) into fragments of, say, 1001 kb, overlapping by 1 kb, and then fuse the output in the end. Sorry for not having repeated this in my previous email, and sorry for all the confusion that this may have caused. > The run you describe will return the top 500 (by default) local > alignments between the two chromosomes. Oh, again I am sorry that I haven't specified all details. We always want to get "all" local alignments below the specified E-value. So, we typically use 50,000 for the -b and -v flags. 50,000 is typically enough in our examples: if we found that the number of local alignments had reached 50,000, we would increase that number and run that particular Blast job again, but so far it has never happened in our analyses. > represents a tiny fraction of the similarities that are interesting to > the genomic scientists with whom I work. Of course, you are right, sorry for the confusion, see above. > An example: one of my users really likes to make chromosome / > chromosome similarity maps (sparse matrixes) for what he calls "genome > archeology." This is THE classic reason for doing genome on genome > BLASTs. We do it by breaking each chromosome into overlapping chunks > of arbitrary size (say, 10,000bp with 1,000bp overlap) and then doing > an "all vs. all" set of BLASTs. That is exactly what we do, see above, and see my previous posts, except that we choose larger chunk sizes, at least 101 kb, and often 1001 kb. And we choose both the chunk size and the overlap size problem dependently. > This way of setting up the problem has three benefits: Thanks for spelling this out in great detail. > As a side benefit, benchmarking on this this scenario maps *precisely* > to a bioinformatic problem that *will* continue to grow without bound: Okay, great, then we perfectly agree that "that sort of Blast job" could be a useful benchmark, then let's do it. Didn't Jeff Bizzaro recently acquire a dual-G4 machine? > An interesting (and more thorough) system benchmark would look for an > optimal chunk size for the system in question. Great point! But of course that is problem dependent! Also, I forgot to mention that we usually Blast only repeat-masked sequences, which reduces the running time (and also the memory requirement) substantially. Again, thanks for your great comments. Best regards, Ivo