[Bioclusters] a dedicated cluster to mpiblast the nr database
Michael Cariaso
bioclusters@bioinformatics.org
Fri, 05 Dec 2003 14:42:33 -0500
Since my question seems to have sparked considerable, and very useful
responses both on and off the list, I'm going to try and summarize the
feedback I've gotten.
a ram cache of the db will be a big help
but a linux process can only use 2 or 3GB [1]
So the job may need to be spread across several smaller machines
which is what mpiBLAST is intended for
mpiBLAST uses NCBI BLAST and therefore the cpu effects should be
proportional between them.
Determining the optimal size of the database per node, will be
important, but trial and error
I'll probably need more nodes, each with less memory, than I had
originally anticipated
which will increase the total price :-(
a raid0 should help minimize diskIO, which is suspected as the next
bottleneck
[1] I've heard 2 & 3 from different responders.No definitive answer yet.
I'm playing email tag with ncbi in hopes of learning more about the
2/3GB memory limit
And what benefits a 64bit cpu might provide
This cluster is intended exclusively for blast, and will not support
on-demand queries.
At present I'm leaning toward a cluster of rackmounts each with 4GB and
dual 2.4Ghz Xeons.
Several people have contacted me to suggest alternative suppliers. And
I'm eager to hear more such responses.
I'm pleased to say all of those responses were made privately, not to
the general list.
I'll start with perhaps 4 machines, and profile performance against
truncated versions of th nr database.
Keeping an eye out for a serious performance hit as the db size grows.
Then establish how many additional machines might be necessary for the
full nr collection, and anticipated growth.
I'm still not sure if there should be a master node, or a cluster of equals.
Since there will be a certain amount of reliance on profiling and
benchmarking shared experiences with tools and techniques would be helpful.