[Bioclusters] Parallel blast

Wim Glassee bioclusters@bioinformatics.org
Fri, 7 Jun 2002 14:14:07 +0200


> -----Original Message-----
> From: bioclusters-admin@bioinformatics.org [mailto:bioclusters-
> admin@bioinformatics.org] On Behalf Of chris dagdigian
> Sent: vrijdag 7 juni 2002 13:56
> To: bioclusters@bioinformatics.org
> Subject: Re: [Bioclusters] Parallel blast
> 
> 
> Hi Wim,
> 
> This will be a quickie response...
> 
> With newer versions of ncbi-blast there are 2 things that have made
the
> process of splitting up the target databases so that your query can be
> multiplexed across multiple searches and machines far easier:
> 
> o The "-z" option switch (used to be undocumented I think?) allows you
> to override/tell the blastall binary the effective size of the
database.
> If you feed the original (large) value to the blastall binary while
> searching against the small slice you will at least get back the
correct
> scores and statistics.  This is a huge time and accuracy saver as
trying
> to parse and adjust these values after the fact is a giant error-prone
> excercise in pain.
> 

I've been messing around with blast for quite a while now. I'm using the
-z flag and the xml output to eventually get a result that is identical
to the normal blast output (without partitioning). What few people know
is that the -z parameter alone is not enough. The statistics of blast
are also based on the number of sequences in the database. For this
given, there is no parameter for blastall.

Likewise, I understand that some people divide query sequences as well.
If you want your results to be the same, you have to let blast know just
how big your 'original' query was, so it can calculate its statistics
correctly.

I'm working on a solution that does both these things and merges the
output files, but I'm afraid it's not as easy as it sounds.

I'm just wondering if there any other parallel blast solutions, so I can
spare me the hassle of trying to do this myself

> o XML output of results
> 
> Having the scores and statistics correct while getting the results
back
> in a way that is far easier to parse than the human readable version
is
> 95% of the battle.  Everything else is fairly simple.

The XML output (although it has a pretty strange format) helps, that's
for sure!

Wim

> 
> -Chris
> 
> 
> Wim Glassee wrote:
> 
> ><snip>
> >
> >I've noticed some people cut their databases and query sequences to
> >smaller pieces, with or without overlap, and perform separate blasts.
> >But how do you put them back together again? And are the results the
> >same?
> >
> >Wim
> >
> >
> >
> >
> 
> 
> --
> Chris Dagdigian, <dag@sonsorol.org>
> Life Science IT & Research Computing Consultant
> Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193
> Work: http://bioteam.net PGP KeyID: 83D4310E  Yahoo IM: craffi
> 
> 
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> http://bioinformatics.org/mailman/listinfo/bioclusters