[Bioclusters] distributed blasting of genomes and WASHU blast

Tim Harsch bioclusters@bioinformatics.org
Tue, 11 Feb 2003 19:50:35 -0800


Two questions here (the quick one first):
    1)    How do you tell WASHU blast to return more than 1000 hits when
using tblastx?

    2)    If I have two large genomes that need a lengthy blast, how can I
split that up?

Just considering an SMP machine for now, perhaps SGE later..  As we know
threading is not as effective as individual blasts.  In my case, with one
genome as the database and one as the query, WASHU blast is never using more
than one thread so no parallelism is achieved.  I'm thinking that I could
take my query sequence split it into X parts and blast one part per CPU but
then what about the boundaries between sequences as possible hits?  If I
want to assume no before-hand knowledge of the genome here, I'm thinking I
could process the results from the X parts, find the stop base of the last
hit on the X-1 part, call it A, and the start base of the first hit of the X
part, call it B, and create a subsequence from A to B from the original
database sequence, repeat for all boundarties of the X parts, then blast
these new subsequences against the database then union the hits from this
with hits from the X parts.

If I'm correct, using this method my e-values would even be the same than if
I had done a simple one-on-one comparison, because my database never
changes.

Does this sound reasonable?  Even so, if there is an easier method then I
sure would like to hear it.

Ciao,

Tim Harsch
Computer Scientist
Lawrence Livermore National Laboratory