I had looked into this a number of years ago for SGI GenomeCluster. A colleague had noticed that he obtained much better load balance for parallel ClustalW using a "sort" method (making the chunks more uniform), than by leaving the data as found. I tried this with the query sequences, and found a little benefit. I did not try with the database. Some of the database entries are huge. These huge entries pose a problem with the alignment algorithms. If there were a way one could build an approximate function that represents the time to calculate an alignment, you might be able to get creative with the subdivision. Even then you would really need to make sure the scheduler was aware of the huge bubble. The idea is that the load balance gets shot all out of whack when one or two database fragments dominate the time due to excessively long strings. The shuffle should try to preserve something like the length distribution in the entire database. Even better would be a simple code to scan through the database, make approximate segments, and indicate how "close" to the full database distribution they are. Joe On Sun, 29 Jun 2003 23:45:43 -0400, Lucas Carey wrote > Has anyone looked into why there is such a large speedup when > shuffling the database? Does this hold for the query as well? Are > you just randomizing the db sequence entries? > > -Lucas > > On Wed, May 14, 2003 at 11:50:31AM -0400, Joe Landman wrote: > > On Wed, 2003-05-14 at 11:43, Jason D. Gans wrote: > > > > > Also, while not a factor when blasting against the nr database, shuffling the > > > nt database yields a substantial speed increase in blast searches (I have obtained > > > a 28% decrease in wall clock time for certain nucleotide queries). > > > > I noted in 1999 and 2000 while working on GenomeCluster that a query > > sequence "sort" or shuffle sometimes helped. I didn't do that on the db > > side due to the time costs of the operation. Maybe worth a re-look. > > > > > > -- > > Joseph Landman, Ph.D > > Scalable Informatics LLC, > > email: landman@scalableinformatics.com > > web : http://scalableinformatics.com > > phone: +1 734 612 4615 > > > _______________________________________________ > Bioclusters maillist - Bioclusters@bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters -- Joseph Landman, Ph.D Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://scalableinformatics.com phone: +1 734 612 4615