[Bioclusters] BLAST job time estimates

Micha Bayer bioclusters@bioinformatics.org
08 Jun 2004 09:57:06 +0100


Thanks Chris, that's useful to know. 

What do you do when the target databases change in size (as they do with
every update) - have you developed a formula for adjusting the runtimes
then?

Also, what is runtime in this context - CPU time or walltime?

Since I wrote the original message I have run a few test runs myself and
I have actually found that the size of the target database is a much
stronger predictor of the wall time the job takes than the query size.
Times seem pretty consistent across different length queries run against
the same target (I randomly generate my test queries now).

cheers
Micha


On Mon, 2004-06-07 at 17:04, Chris Dwan wrote:
> > It looks like I stuck with doing the time prediction because we are
> > plugging into an existing cluster with existing rules, much as I would
> > like to avoid this issue altogether.... :-)
> 
> I find that BLAST run time prediction is pretty consistent (within 5% 
> or so) based only on query length, provided that you're allowed to run 
> a set of tests on the exact target in question, on the exact machines 
> in question.  I've got an instrumented version of the EnsEMBL pipeline 
> which saves runtimes (and queue waits, and all sorts of other goodies) 
> for later perusal.   On an analysis containing 627 contigs from 
> Medicago truncatula the times for blastn vs NCBI NT on our Xserves 
> (bins of 10,000bp length) look like this:
> 
> mysql> select count(distinct(contig_id)) as num_contigs, floor(length / 
> 10000) as bp, avg(runtime), \
>               std(runtime), run_queue  from contig, input_id_analysis 
> where input_id = name and analysis_id = 3 \
>              and run_queue = "CCGB_XSERVE" group by bp, run_queue order 
> by run_queue, bp;
> +-------------+------+--------------+--------------+-------------+
> | num_contigs | bp   | avg(runtime) | std(runtime) | run_queue   |
> +-------------+------+--------------+--------------+-------------+
> |           3 |    0 |     246.0000 |       6.5320 | CCGB_XSERVE |
> |           4 |    1 |     424.7500 |      39.3407 | CCGB_XSERVE |
> |           1 |    2 |     803.0000 |       0.0000 | CCGB_XSERVE |
> |           3 |    3 |     790.6667 |      65.6523 | CCGB_XSERVE |
> |           6 |    4 |    1063.8333 |      64.2117 | CCGB_XSERVE |
> |           5 |    5 |    1217.8000 |      90.1341 | CCGB_XSERVE |
> |           5 |    6 |    1354.4000 |      65.0372 | CCGB_XSERVE |
> |           8 |    7 |    1630.7500 |      70.1334 | CCGB_XSERVE |
> |           5 |    8 |    1886.8000 |      70.8220 | CCGB_XSERVE |
> |           7 |    9 |    2065.2857 |      99.2928 | CCGB_XSERVE |
> |          20 |   10 |    2299.0000 |      99.3700 | CCGB_XSERVE |
> |          18 |   11 |    2523.0000 |     125.7763 | CCGB_XSERVE |
> |          23 |   12 |    2714.9565 |     157.0911 | CCGB_XSERVE |
> |          14 |   13 |    3016.5714 |      81.9658 | CCGB_XSERVE |
> |           6 |   14 |    3264.6667 |      64.0356 | CCGB_XSERVE |
> |           1 |   16 |    3817.0000 |       0.0000 | CCGB_XSERVE |
> +-------------+------+--------------+--------------+-------------+
> 
> BLASTX vs Uniref looks like:
> 
> mysql> select count(distinct(contig_id)) as num_contigs, floor(length / 
> 10000) as bp, avg(runtime), \
>               std(runtime), run_queue  from contig, input_id_analysis 
> where input_id = name and analysis_id = 14 \
>              and run_queue = "CCGB_XSERVE" group by bp, run_queue order 
> by run_queue, bp;
> +-------------+------+--------------+--------------+-------------+
> | num_contigs | bp   | avg(runtime) | std(runtime) | run_queue   |
> +-------------+------+--------------+--------------+-------------+
> |           4 |    0 |      96.2500 |      18.9126 | CCGB_XSERVE |
> |           5 |    1 |     515.2000 |      85.6491 | CCGB_XSERVE |
> |           2 |    2 |     810.0000 |       2.0000 | CCGB_XSERVE |
> |           2 |    3 |    1326.0000 |     115.0000 | CCGB_XSERVE |
> |           3 |    4 |    1931.6667 |      46.6571 | CCGB_XSERVE |
> |           5 |    5 |    2712.2000 |     150.0325 | CCGB_XSERVE |
> |           3 |    6 |    3104.0000 |      99.5624 | CCGB_XSERVE |
> |           6 |    7 |    3799.5000 |     218.6342 | CCGB_XSERVE |
> |           3 |    8 |    5052.0000 |     205.0870 | CCGB_XSERVE |
> |           7 |    9 |    5697.5714 |     480.6186 | CCGB_XSERVE |
> |           7 |   10 |    6887.2857 |     385.0632 | CCGB_XSERVE |
> |          19 |   11 |    7707.6316 |     342.8089 | CCGB_XSERVE |
> |          18 |   12 |    8812.0000 |     502.8817 | CCGB_XSERVE |
> |          10 |   13 |    9638.8000 |     726.1260 | CCGB_XSERVE |
> |           6 |   14 |   10457.5000 |     742.9995 | CCGB_XSERVE |
> |           2 |   16 |   13521.0000 |      58.0000 | CCGB_XSERVE |
> +-------------+------+--------------+--------------+-------------+
> 
> -Chris Dwan
>    The University of Minnesota
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
-- 
--------------------------------------------------
Dr Micha M Bayer
Grid Developer, BRIDGES Project
National e-Science Centre, Glasgow Hub
246c Kelvin Building
University of Glasgow
Glasgow G12 8QQ
Scotland, UK
Email: michab@dcs.gla.ac.uk
Project home page: http://www.brc.dcs.gla.ac.uk/projects/bridges/
Personal Homepage: http://www.brc.dcs.gla.ac.uk/~michab/
Tel.: +44 (0)141 330 2958