[Bioclusters] mpiBLAST Performance

landman bioclusters@bioinformatics.org
Mon, 30 Jun 2003 00:12:00 -0500


I had looked into this a number of years ago for SGI GenomeCluster.  A
colleague had noticed that he obtained much better load balance for parallel
ClustalW using a "sort" method (making the chunks more uniform), than by
leaving the data as found.

I tried this with the query sequences, and found a little benefit.  I did not
try with the database.  Some of the database entries are huge.  These huge
entries pose a problem with the alignment algorithms.  If there were a way one
could build an approximate function that represents the time to calculate an
alignment, you might be able to get creative with the subdivision.  Even then
you would really need to make sure the scheduler was aware of the huge bubble.  

The idea is that the load balance gets shot all out of whack when one or two
database fragments dominate the time due to excessively long strings.  The
shuffle should try to preserve something like the length distribution in the
entire database.  Even better would be a simple code to scan through the
database, make approximate segments, and indicate how "close" to the full
database distribution they are.

Joe

On Sun, 29 Jun 2003 23:45:43 -0400, Lucas Carey wrote
> Has anyone looked into why there is such a large speedup when 
> shuffling the database? Does this hold for the query as well? Are 
> you just randomizing the db sequence entries? 
> 
> -Lucas
> 
> On Wed, May 14, 2003 at 11:50:31AM -0400, Joe Landman wrote:
> > On Wed, 2003-05-14 at 11:43, Jason D. Gans wrote:
> > 
> > > Also, while not a factor when blasting against the nr database,
shuffling the 
> > > nt database yields a substantial speed increase in blast searches (I
have obtained
> > > a 28% decrease in wall clock time for certain nucleotide queries).
> > 
> > I noted in 1999 and 2000 while working on GenomeCluster that a query
> > sequence "sort" or shuffle sometimes helped.  I didn't do that on the db
> > side due to the time costs of the operation.  Maybe worth a re-look.
> > 
> > 
> > -- 
> > Joseph Landman, Ph.D
> > Scalable Informatics LLC,
> > email: landman@scalableinformatics.com
> > web  : http://scalableinformatics.com
> > phone: +1 734 612 4615
> > 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters


--
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 612 4615