[Bioclusters] free parallel versions of BLAST

Aaron Darling bioclusters@bioinformatics.org
Thu, 26 Feb 2004 14:09:09 -0600 (CST)

Because parallel BLAST is such a common problem, numerous free/open-source
implementations exist.  Obviously mpiBLAST won't work on a unix cluster
without message passing, and its unclear to me whether MPI and condor will
play nice with each other on windows (anybody have success with this?).
If there are other reasons mpiBLAST is unsuitable for you I'd like to hear
about them, the software is still being actively developed and we are open
to suggestion for features!

As part of writing a grant for the mpiBLAST project I did some research on
other free, open-source parallel BLAST options.  Here's a brief overview
of what I was able to find.  If I'm missing any significant projects or
I've got the details wrong please correct me.  Also, this only covers
parallelizations that use database segmentation.  Because query
segmentation is easy so many programs have been written to use it
exclusively that I'd be hard pressed to list them all.

- designed for NxN comparisons of sequence databases, e.g. every database entry gets BLAST searched against every other database entry
- stores results in ASN.1 format
- adjusts e-values using the database length only, providing approximately correct e-values
- uses MoBiDiCK for job startup on a cluster
- there is a paper describing it here:  http://www.biomedcentral.com/1471-2105/3/13/
- uses unmodified NCBI blastall

- database segmentation
- part of the mollusc package
- written in perl, works under unix
- uses rsh/ssh for job startup (need password-free login to cluster machines)
- adjusts e-values using a linear-regression model that provides approximate e-value statistics
- supports text output formats only
- uses unmodified NCBI blastall

- database segmentation
- free only for non-commercial use
- written in perl, works under unix
- requires manual database distribution
- requires OpenPBS for job management
- e-value adjustments are (purportedly) accurate.  dBlast uses both
  the effective db length and the effective query length to calculate
  e-values.  Their clever method for e-value adjustment inspired us to
  make some changes for the next mpiBLAST release to give accurate e-value
- supports text output formats only
- requires compiling a modified NCBI blastall
- see http://www.cmbi.kun.nl/software/dBlast/ for more info

parallelblast by David Mathog
- database segmentation
- written in perl/C, works under unix
- uses PVM, and optionally SGE
- does approximate e-value adjustment using the effective db length
- supports text and html output formats
- requires compiling a modified NCBI blastall
- http://bioinformatics.oupjournals.org/cgi/content/abstract/19/14/1865?ijkey=13CoOSo3fnITz&keytype=ref

- database segmentation
- written in c++, works under unix/windows
- requires MPI, optionally PBS, SGE, LSF, or Condor
- e-value adjustments are approximate based on db. length (but as previously mentioned, the next release will include accurate e-value statistics)
- supports all of the NCBI blastall output formats (text, html, XML, ASN.1)
- requires compiling the NCBI Toolkit
- includes code to interface a wwwblast server with mpiBLAST + PBS
- more info at http://mpiblast.lanl.gov

Also since you're considering BLAST under Windows, you may want to check
into what the Cornell Theory Center is using for parallel BLAST on their
windows cluster.  I don't know whether their software is publicly or
freely available however.

None of the freely availably options (that I am aware of) currently
implement combined query and database segmentation.

I've found the lack of a comprehensive resource for information
on parallel BLAST frustrating.  Hopefully this e-mail will prove to be a
useful resource for people considering parallel BLAST options.


On Thu, 26 Feb 2004, Micha Bayer wrote:

> Hi,
> does anyone know of a non-commercial, open source/free package that
> provides a parallelisation of BLAST (apart from mpiBLAST which is not
> suitable for us).
> I am interested in something that would split input files into single
> query sequences, partition the database and collate the results (ideally
> with an adjustment of the e-values etc).
> It looks like some of the commercial packages like Paracel do all of the
> above but I really need an open source version and before I get writing
> my own I want to make sure I have tried all the available options.
> I am looking to run a service both on a Windows XP based Condor pool and
> on a cluster that uses OpenPBS but has no message passing capabilities
> to speak of.
> cheers
> Micha
> --
> --------------------------------------------------
> Dr Micha M Bayer
> Grid Developer, Bridges Project
> National eScience Centre, Glasgow Hub
> 246c Kelvin Building
> University of Glasgow
> Glasgow G12 8QQ
> Scotland, UK
> Email: michab@dcs.gla.ac.uk
> Project home page: http://www.brc.dcs.gla.ac.uk/projects/bridges/
> Personal Homepage: http://www.brc.dcs.gla.ac.uk/~michab/
> Tel.: +44 (0)141 330 2958
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters