[Bioclusters] blastall and SGE

Aaron Darling bioclusters@bioinformatics.org
Wed, 29 Sep 2004 14:45:09 -0500


> The only work-around that seems to really work is running btblastall 
> on the command line with a database that has been forced to segment 
> into many more segments, rather than 15 (one for every node) into 30 
> or 32 (one for every processor).  This, on the command line, seems to 
> distribute jobs a little more efficiently, as well as utilizing more 
> CPU power than any other run.
>
> Any thoughts would be VERY helpful.
>
Just a word of caution on increasing the segmentation of blast 
databases:  Through our work on mpiBLAST we discovered that the time to 
search a blast database grows significantly as it is split into an 
increasing number of fragments.
The same blast database, when split into 100 fragments, took 30% longer 
to search with standard NCBI blastall than the unsplit database.
We originally wanted to do load balancing in mpiBLAST with database 
segmentation alone, but the tremendous overhead for searching a heavily 
fragmented database has prompted us to implement query pipelining for 
load balancing in our upcoming release.
In your case however, it sounds like the value of claiming unused CPU 
cycles may outweight the cost of additional database segmentation.

-Aaron