[Bioclusters] BLAST job time estimates

Tim Cutts bioclusters@bioinformatics.org
Tue, 1 Jun 2004 13:10:24 +0100


On 1 Jun 2004, at 11:57 am, Micha Bayer wrote:

> A formula would presumably take into account things like the length and
> number of the input queries, the size and makeup of the target database
> (i.e. number and length of sequences contained in this), the
> similarities between the query and the target sequences and local
> hardware parameters (processors, memory, local network speeds etc).

I think it's very difficult to predict.  I'm pretty certain the 
algorithm is O(n*m) in both memory and time, but a meaningful 
prediction of a real time is very difficult indeed, since the number of 
HSPs found will make an enormous difference to both memory use and 
time.  And the number of HSPs you find can vary enormously depending on 
the exact parameters you give to BLAST, even with identical input 
sequences.

We don't bother with this sort of estimation.  LSF can improve its 
scheduling by using such estimates, but we don't bother, and just use 
LSF's fairshare mechanism.  If a user is submitting very long running 
jobs, their priority will dynamically fall off to give other users a 
crack at that CPUs, so it all works out OK in the end.

Rather more important, in our experience, is estimating how much RAM 
the job is going to require.  Memory overcommits are one of our biggest 
problems now, especially on our larger SMP boxes.  We have a 32-way 
machine with 192 GB of memory, which *regularly* runs out of virtual 
memory.  The LSF queue that services that machine now has an esub in 
place to force people to provide LSF with an estimate of how much 
memory the job will use, but there's still no way we can force them to 
be accurate!  We've ended up putting strict memory use limits on the 
LSF queues, and jobs which exceed those limits get killed.

Tim

-- 
Dr Tim Cutts
Informatics Systems Group
Wellcome Trust Sanger Institute
Hinxton, Cambridge, CB10 1SA, UK