[Bioclusters] BLAST job time estimates

Joe Landman bioclusters@bioinformatics.org
Tue, 08 Jun 2004 09:17:56 -0400


On Tue, 2004-06-08 at 07:12, Micha Bayer wrote:

[...]

> The BLAST manual says that databases can be loaded into memory but there
> does not seem to be a way of forcing this - it seems to be up to the OS
> to decide whether it loads the db into memory or not. 

Yes.  The database indecies are mmap'ed in.  Mmap is a tool whereby a
file is accessed by mapping pages of the file to memory.  Further reads
from the file should come from pages already in memory.  That is, as
long as you have sufficient memory to hold the database index.  This has
to be determined by the OS at run time.  You can tune/tweak the kernel,
but in the end you are bound by the OS here.

> On my machine here (Linux RH9) it does not seem to load the database
> into memory regardless of its size. I have tried the time command
> recently with my BLAST runs, which conveniently also records page
> faults, and I get the following output when I run a query against
> ecoli.nt (which is pathetically small, a few mb tops, and should easily
> fit into my 1gb memory):
> 
> >/usr/bin/time -v -- blastall -p blastn -d ecoli.nt -i test.txt -o
> test.out
>         Command being timed: "blastall -p blastn -d ecoli.nt -i test.txt
> -o test.out"
>         User time (seconds): 0.01
>         System time (seconds): 0.02
>         Percent of CPU this job got: 8%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.34
>         Average shared text size (kbytes): 0
>         Average unshared data size (kbytes): 0
>         Average stack size (kbytes): 0
>         Average total size (kbytes): 0
>         Maximum resident set size (kbytes): 0
>         Average resident set size (kbytes): 0
>         Major (requiring I/O) page faults: 792
>         Minor (reclaiming a frame) page faults: 621
>         Voluntary context switches: 0
>         Involuntary context switches: 0
>         Swaps: 0
>         File system inputs: 0
>         File system outputs: 0
>         Socket messages sent: 0
>         Socket messages received: 0
>         Signals delivered: 0
>         Page size (bytes): 4096
>         Exit status: 0
> 
> To me the number of page faults suggests clearly that the db is not in
> memory. Does that mean I cannot ever get the db into memory and on Linux
> all BLAST searches will take a huge performance hit because of this?

No, this means that your job is to short for meaningful measurement. 
Are you sure it is not failing?  0.34s total execution time makes me
quite suspicious.

try

	strace blastall -p blastn -d ecoli.nt -i test.txt -o test.out

and see if it generates an error.  Also, look in test.out to make sure
it worked.

> Where does that leave things like mpiBLAST which gets its performance
> increase from the db fitting into memory?

mpiBLAST uses the "-v N" option I indicated previously to force the
issue as I had suggested.  It actually works quite well, though the
shared NFS path and the scheduler become the more important factors
inhibiting performance at larger node counts.

> 
> Maybe someone can shed some light on this......
> 
> > See above.  How large are your databases?
> 
> I plan to run the queries against the standard nr and nt databases and
> perhaps whole chromosome dbs as well. nt is currently about 2.6 gb, nr
> about 600 mb.

nr last I downloaded it on May 20th, is 906.8 x 10**6 bytes (~907 MB). 
When you uncompress nt, it is much larger.  If you have 1 GB ram, you
want to target about 1/3 to 1/2 GB for the index size.  For nr and nt,
try using 

	-v 300 

on the formatdb command line.  Should give you 3 nr segments, and many
nt segments.

> 
> Micha
-- 
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 612 4615