Hi Xiaowu: mpiblast makes a great deal of sense with large numbers of input sequences, or with huge databases (nt). There are startup costs to moving the database, and typically you will get the best performance by amortizing those costs over a large analysis (e.g. many sequences). It is possible (without knowing more about your situation), that the rate limiting factor for your analysis is the speed of moving the database fragments to the remote machines (this is still a serial process even in mpiblast). If you are doing many sequence comparisons, you will benefit as the database fragment motion needs to occur only once. If you are doing very few (under 100) sequence comparisons, then the database fragment motion is liable to dominate your execution time. If you simply need a faster parallel blast, you might look into pre-fetching the database fragments to the remote nodes, in which case you no longer have that startup cost (though I don't remember if mpiblast works with a prefetched set of databases). As this effectively defeats the mpiblast scheduler (which is one of the very nice features of the code), this is not such a good method to use mpiblast with, though it works nicely with NCBI/WU blast. If Aaron is around, hopefully he can give you a more accurate/sound answer, and correct any mistakes I may have made in suppositions. Joe Xiaowu Gai wrote: > Hi Everyone: > > We have a 16-node Xserve cluster, with 2GB memory on each node and dual > processors. I was able to install mpiBLAST on it, along with LAM/MPI. > However, the performance that I saw with some test runs has not been that > good and quite confusing. Here is what I did: > > > 1.) I formatted the nt database: > > mpiformatdb -N 16 -i nt > > 2.) I ran the mpiblast on one, two, five, ten, twenty, and more sequences > (about 500bp each) and with the command: > > time mpirun N mpiblast -p blastn -d nt -i single.fa -o blast_results. > > Here are the numbers: > > Single: 1m39.054s > Two: 0m11.009s > Five: 0m16.021s > Ten: 0m46.591s > twenty: 3m7.541s > .. > > > I am all confused. First of all, the performance is not that impressive. > Secondly, the numbers are very confusing to me. Why is that a single > sequence query takes so much more time than a two (BTW, I reran the query of > a single sequence right after the query of two and got similar results)? And > query of five takes only 5 seconds more than the query of two and so on.. > > I am afraid that I have done something wrong and would really appreciate any > thoughts. > > Thanks > > Xiaowu > > _______________________________________________ > Bioclusters maillist - Bioclusters at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615