[Bioclusters] BLAST job time estimates

Joe Landman bioclusters@bioinformatics.org
Tue, 08 Jun 2004 06:49:07 -0400


On Tue, 2004-06-08 at 04:57, Micha Bayer wrote:
> Thanks Chris, that's useful to know. 
> 
> What do you do when the target databases change in size (as they do with
> every update) - have you developed a formula for adjusting the runtimes
> then?

If you make observations across multiple query and database sizes, you
might be able to fit a functional form to it.  The problem of course is
that there are so many influencing factors.  The search is at least O(N)
in the number of sequences in the library.  Each query search is a
hashed version of the query word against the library, and this hash
generation is something like O(l) where l is the length of the query. 
The search is likely to be close to O(N*l).  

You could construct a very simple time estimator model

	Time(t,l[i])	= (a*(N_seq))*(b[i]*l[i])

and fit your data to it.  N_seq are the number of sequences in the
database, l[i] is the length of the i th query sequence.  The problem is
that you will have many b[i]'s.  You would want to try a case where you
had N_seq=1 and a few others to place constraints on some of the
constants.

Even then, after a good fit, there are still multiple factors that would
influence run time.  The biggest factor would be the index database size
as compared to the available memory size.  If you overflow local ram,
the mmap function will flush pages, and you will introduce disk I/O for
your indices, which could be a significant performance inhibitor,
depending upon how much I/O is needed.

To alleviate this, use the "-v N" switch on formatdb, where N is a size
in MB.  This fragments the database index into approximately N megabyte
size segments.  This gives you a similar effect to the optimization
technique named "blocking".

> 
> Also, what is runtime in this context - CPU time or walltime?

Should be wall time.

> 
> Since I wrote the original message I have run a few test runs myself and
> I have actually found that the size of the target database is a much
> stronger predictor of the wall time the job takes than the query size.
> Times seem pretty consistent across different length queries run against
> the same target (I randomly generate my test queries now).

See above.  How large are your databases?

> 
> cheers
> Micha
> 
> 
> On Mon, 2004-06-07 at 17:04, Chris Dwan wrote:
> > > It looks like I stuck with doing the time prediction because we are
> > > plugging into an existing cluster with existing rules, much as I would
> > > like to avoid this issue altogether.... :-)
> > 
> > I find that BLAST run time prediction is pretty consistent (within 5% 
> > or so) based only on query length, provided that you're allowed to run 
> > a set of tests on the exact target in question, on the exact machines 
> > in question.  I've got an instrumented version of the EnsEMBL pipeline 
> > which saves runtimes (and queue waits, and all sorts of other goodies) 
> > for later perusal.   On an analysis containing 627 contigs from 
> > Medicago truncatula the times for blastn vs NCBI NT on our Xserves 
> > (bins of 10,000bp length) look like this:
> > 
> > mysql> select count(distinct(contig_id)) as num_contigs, floor(length / 
> > 10000) as bp, avg(runtime), \
> >               std(runtime), run_queue  from contig, input_id_analysis 
> > where input_id = name and analysis_id = 3 \
> >              and run_queue = "CCGB_XSERVE" group by bp, run_queue order 
> > by run_queue, bp;
> > +-------------+------+--------------+--------------+-------------+
> > | num_contigs | bp   | avg(runtime) | std(runtime) | run_queue   |
> > +-------------+------+--------------+--------------+-------------+
> > |           3 |    0 |     246.0000 |       6.5320 | CCGB_XSERVE |
> > |           4 |    1 |     424.7500 |      39.3407 | CCGB_XSERVE |
> > |           1 |    2 |     803.0000 |       0.0000 | CCGB_XSERVE |
> > |           3 |    3 |     790.6667 |      65.6523 | CCGB_XSERVE |
> > |           6 |    4 |    1063.8333 |      64.2117 | CCGB_XSERVE |
> > |           5 |    5 |    1217.8000 |      90.1341 | CCGB_XSERVE |
> > |           5 |    6 |    1354.4000 |      65.0372 | CCGB_XSERVE |
> > |           8 |    7 |    1630.7500 |      70.1334 | CCGB_XSERVE |
> > |           5 |    8 |    1886.8000 |      70.8220 | CCGB_XSERVE |
> > |           7 |    9 |    2065.2857 |      99.2928 | CCGB_XSERVE |
> > |          20 |   10 |    2299.0000 |      99.3700 | CCGB_XSERVE |
> > |          18 |   11 |    2523.0000 |     125.7763 | CCGB_XSERVE |
> > |          23 |   12 |    2714.9565 |     157.0911 | CCGB_XSERVE |
> > |          14 |   13 |    3016.5714 |      81.9658 | CCGB_XSERVE |
> > |           6 |   14 |    3264.6667 |      64.0356 | CCGB_XSERVE |
> > |           1 |   16 |    3817.0000 |       0.0000 | CCGB_XSERVE |
> > +-------------+------+--------------+--------------+-------------+
> > 
> > BLASTX vs Uniref looks like:
> > 
> > mysql> select count(distinct(contig_id)) as num_contigs, floor(length / 
> > 10000) as bp, avg(runtime), \
> >               std(runtime), run_queue  from contig, input_id_analysis 
> > where input_id = name and analysis_id = 14 \
> >              and run_queue = "CCGB_XSERVE" group by bp, run_queue order 
> > by run_queue, bp;
> > +-------------+------+--------------+--------------+-------------+
> > | num_contigs | bp   | avg(runtime) | std(runtime) | run_queue   |
> > +-------------+------+--------------+--------------+-------------+
> > |           4 |    0 |      96.2500 |      18.9126 | CCGB_XSERVE |
> > |           5 |    1 |     515.2000 |      85.6491 | CCGB_XSERVE |
> > |           2 |    2 |     810.0000 |       2.0000 | CCGB_XSERVE |
> > |           2 |    3 |    1326.0000 |     115.0000 | CCGB_XSERVE |
> > |           3 |    4 |    1931.6667 |      46.6571 | CCGB_XSERVE |
> > |           5 |    5 |    2712.2000 |     150.0325 | CCGB_XSERVE |
> > |           3 |    6 |    3104.0000 |      99.5624 | CCGB_XSERVE |
> > |           6 |    7 |    3799.5000 |     218.6342 | CCGB_XSERVE |
> > |           3 |    8 |    5052.0000 |     205.0870 | CCGB_XSERVE |
> > |           7 |    9 |    5697.5714 |     480.6186 | CCGB_XSERVE |
> > |           7 |   10 |    6887.2857 |     385.0632 | CCGB_XSERVE |
> > |          19 |   11 |    7707.6316 |     342.8089 | CCGB_XSERVE |
> > |          18 |   12 |    8812.0000 |     502.8817 | CCGB_XSERVE |
> > |          10 |   13 |    9638.8000 |     726.1260 | CCGB_XSERVE |
> > |           6 |   14 |   10457.5000 |     742.9995 | CCGB_XSERVE |
> > |           2 |   16 |   13521.0000 |      58.0000 | CCGB_XSERVE |
> > +-------------+------+--------------+--------------+-------------+
> > 
> > -Chris Dwan
> >    The University of Minnesota
> > 
> > _______________________________________________
> > Bioclusters maillist  -  Bioclusters@bioinformatics.org
> > https://bioinformatics.org/mailman/listinfo/bioclusters
-- 
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 612 4615