[Bioclusters] BLAST job time estimates

Tue, 8 Jun 2004 10:28:39 -0500

> What do you do when the target databases change in size (as they do 
> with
> every update) - have you developed a formula for adjusting the runtimes
> then?

Short answer:  No, and I don't need to.  Only one of the clusters on 
which I run requires a time estimate, and there is no penalty for 
overestimating other than scheduling priority (I'm charged for what I 
use, not what I reserve).  Therefore, I say "12 hours" for all blastn 
vs NT or blastx vs Uniref, and "1 hour" for all the rest (an assortment 
of chromosomes, tiger gene indexes, and full length cDNA sequences).

My questions would be "how accurate do you need to be" and "is there a 
penalty for overestimating?"

I don't use the runtime queries I shared for anything other than after 
the fact analysis.

> I have actually found that the size of the target database is a much
> stronger predictor of the wall time the job takes than the query size.
> Times seem pretty consistent across different length queries run 
> against
> the same target (I randomly generate my test queries now).

This is true as long as your query sequences are small (under 
~10,000bp).  My queries are BACs, up to 160,000bp in length.  There is 
a large runtime difference between a query of 10,000bp and one with 
100,000bp.

A plot of wallclock runtime in seconds as a function of query size in 
bp for BLASTN (for a variety of processors, all single thread, lots of 
uncontrolled variables, some limitations may apply):

   http://ccgb.umn.edu/~cdwan/benchmarks/image007.gif

It's pretty well known that (within certain reasonable limits) blastn 
is limited by how efficiently you can get the target out of storage, 
and blastx is limited by the clock speed of the processor.

Once upon a time, I tried to find a query length which would maximize 
query bp analyzed per second.  It turned out to be a more efficient use 
of my time to go looking for more processors instead.  :)

-Chris Dwan