[Bioclusters] a dedicated cluster to mpiblast the nr database

Fri, 05 Dec 2003 14:42:33 -0500

Since my question seems to have sparked considerable, and very useful 
responses both on and off the list, I'm going to try and summarize the 
feedback I've gotten.

a ram cache of the db will be a big help
but a linux process can only use 2 or 3GB [1]
So the job may need to be spread across several smaller machines
which is what mpiBLAST is intended for
mpiBLAST uses NCBI BLAST and therefore the cpu effects should be 
proportional between them.
Determining the optimal size of the database per node, will be 
important, but trial and error
I'll probably need more nodes, each with less memory, than I had 
originally anticipated
which will increase the total price :-(
a raid0 should help minimize diskIO, which is suspected as the next 
bottleneck

[1] I've heard 2 & 3 from different responders.No definitive answer yet.

I'm playing email tag with ncbi in hopes of learning more about the 
2/3GB memory limit
And what benefits a 64bit cpu might provide

This cluster is intended exclusively for blast, and will not support 
on-demand queries.
At present I'm leaning toward a cluster of rackmounts each with 4GB and 
dual 2.4Ghz Xeons.
Several people have contacted me to suggest alternative suppliers. And 
I'm eager to hear more such responses.
I'm pleased to say all of those responses were made privately, not to 
the general list.

I'll start with perhaps 4 machines, and profile performance against 
truncated versions of th nr database.
Keeping an eye out for a serious performance hit as the db size grows.
Then establish how many additional machines might be necessary for the 
full nr collection, and anticipated growth.
I'm still not sure if there should be a master node, or a cluster of equals.

Since there will be a certain amount of reliance on profiling and 
benchmarking shared experiences with tools and techniques would be helpful.