Sandy, First the nitty gritty: -------------------------- * I recommend using the binary downloads at ftp://ftp.ncbi.nih.gov/ blast/. If computation is ever the limiting factor on your system then switch over to custom binaries. * Follow the README with regards to .ncbirc files and the location of substitution matrices. * I usually install binaries and BLAST targets on an NFS shared directory. This saves me the trouble of updating binaries on all the nodes if anything changes. when access to the datasets becomes the performance limiting factor (if it ever does), I rig a system as described below. * Pull down a couple of pre-formatted targets from NCBI (ftp:// ftp.ncbi.nih.gov/blast/db) to demonstrate functionality. Then schedule a conversation with your users about what target sets they actually want. * If response time on single queries is ever the limiting factor on your system, there are many parallel BLAST solutions available. If it becomes something that people are willing to spend money on, there are also some really impressive hardware accelerators out there. Don't worry about either of these unless you have a demonstrated need for them. More detail: ----------------- Installing and tuning BLAST is a very broad question with lots of history and strong opinions surrounding it. Here are some general thoughts: BLAST is I/O bound on large target sets. The very best thing you can do to improve BLAST performance is to make sure that you have sufficient RAM on each compute node to hold the index files for your target sets. Second to that, fast local disk on the nodes is a big help. I've had great luck with software RAID across two internal disks. Once the above are met, the next bottleneck will be getting the target set from shared storage out to the nodes. Most people who are building a serious BLAST farm set up some way to synchronize the commonly used targets out to the local disk on the nodes. This begs the question of ensuring that you do not disrupt running jobs with a data update. For small installations, this is most simply handled with sociology rather than technology. You're unlikely to achieve super-linear speedups by parallelizing BLAST across the nodes. Parallel BLAST solutions are great for improving response time on a single query, but most users (particularly those with command line access to a cluster) are not interested in just running a single query. The most common use case is the user with thousands of independent queries all to be run as a batch. The most effective way to get this sort of job done is one- job-per-cpu. As many folks have pointed out, this is "high- throughput" computing rather than "high-performance" per se. BLAST targets need to be freshened and updated on a regular basis. This requires some sort of agreement with the users as to their expectations. If nobody plans to use the WGS dataset, that's around 56GB of disk space and network bandwidth that can be saved. Some datasets (NR, NT, etc) are published every few months with daily updates in between. Others have different schedules. Figure out what your users need before building a system to try to support it. Good luck! -Chris Dwan