[Bioclusters] Parallel blast

Fri, 7 Jun 2002 15:01:38 +0200

Wim et al,

I'm just completing an alpha release of MPIsed blast. The functionality is
at the moment limited to protein sequences, but not for long :)
It was primarily designed for a specific task of all vs. all comparison +
clustering and was in use like that for about a year, now I'm generalizing
the code a bit.

it:
interfaces to the SQL database (MySQL) and can directly store results into
it
can output results in a text file - one line per HSP, comma separated
fields.
can operate on fasta files, formatting them on the fly (each is then stored
locally on the worknode).
can filter results based on HSP length coverage (% of HSP len/ total subject
or query len) and score

Here are the benchmarks, on single processor Athlon 1.3G machines. A subset
of swissprot (2000 sequences) was run against the entire database (~86000
entries). the number of HSPs is the total number of results from running
2000 seqs vs swissprot. There is no limit on the number of HSPs per query
run (ususlly 200 with blast). E cutoff was 10.

A) SWISSPROT 86593 sequences, SQL database single table
196963 HSPs (2000 queries) in
	5nodes:  7m21 -> 272 queries/min -> 4.5 queries/s
	10nodes: 3m51s -> 520 queries/min -> 8.6 queries/s
	15nodes: 2m39s -> 750 queries/min -> 12.5 queries/s

B) SWISSPROT 86593 sequences, 10 worker nodes, text output
196963 HSPs (2000 queries) in
	5nodes:  7m22s -> 271 queries/min -> 4.52 queries/s
	10nodes: 3m45s -> 533 queries/min -> 8.8 queries/s
	15nodes: 2m37s -> 765 queries/min -> 12.7 queries/s

anyone interested in beta (alpha, that is :) testing can drop me an email...

Kris

> -----Original Message-----
> From: bioclusters-admin@bioinformatics.org
> [mailto:bioclusters-admin@bioinformatics.org]On Behalf Of Wim Glassee
> Sent: Friday, June 07, 2002 14:14
> To: bioclusters@bioinformatics.org
> Subject: RE: [Bioclusters] Parallel blast
>
>
>
>
> > -----Original Message-----
> > From: bioclusters-admin@bioinformatics.org [mailto:bioclusters-
> > admin@bioinformatics.org] On Behalf Of chris dagdigian
> > Sent: vrijdag 7 juni 2002 13:56
> > To: bioclusters@bioinformatics.org
> > Subject: Re: [Bioclusters] Parallel blast
> >
> >
> > Hi Wim,
> >
> > This will be a quickie response...
> >
> > With newer versions of ncbi-blast there are 2 things that have made
> the
> > process of splitting up the target databases so that your query can be
> > multiplexed across multiple searches and machines far easier:
> >
> > o The "-z" option switch (used to be undocumented I think?) allows you
> > to override/tell the blastall binary the effective size of the
> database.
> > If you feed the original (large) value to the blastall binary while
> > searching against the small slice you will at least get back the
> correct
> > scores and statistics.  This is a huge time and accuracy saver as
> trying
> > to parse and adjust these values after the fact is a giant error-prone
> > excercise in pain.
> >
>
> I've been messing around with blast for quite a while now. I'm using the
> -z flag and the xml output to eventually get a result that is identical
> to the normal blast output (without partitioning). What few people know
> is that the -z parameter alone is not enough. The statistics of blast
> are also based on the number of sequences in the database. For this
> given, there is no parameter for blastall.
>
> Likewise, I understand that some people divide query sequences as well.
> If you want your results to be the same, you have to let blast know just
> how big your 'original' query was, so it can calculate its statistics
> correctly.
>
> I'm working on a solution that does both these things and merges the
> output files, but I'm afraid it's not as easy as it sounds.
>
> I'm just wondering if there any other parallel blast solutions, so I can
> spare me the hassle of trying to do this myself
>
> > o XML output of results
> >
> > Having the scores and statistics correct while getting the results
> back
> > in a way that is far easier to parse than the human readable version
> is
> > 95% of the battle.  Everything else is fairly simple.
>
> The XML output (although it has a pretty strange format) helps, that's
> for sure!
>
> Wim
>
> >
> > -Chris
> >
> >
> > Wim Glassee wrote:
> >
> > ><snip>
> > >
> > >I've noticed some people cut their databases and query sequences to
> > >smaller pieces, with or without overlap, and perform separate blasts.
> > >But how do you put them back together again? And are the results the
> > >same?
> > >
> > >Wim
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Chris Dagdigian, <dag@sonsorol.org>
> > Life Science IT & Research Computing Consultant
> > Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193
> > Work: http://bioteam.net PGP KeyID: 83D4310E  Yahoo IM: craffi
> >
> >
> >
> > _______________________________________________
> > Bioclusters maillist  -  Bioclusters@bioinformatics.org
> > http://bioinformatics.org/mailman/listinfo/bioclusters
>
>
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> http://bioinformatics.org/mailman/listinfo/bioclusters
>