Wim et al, I'm just completing an alpha release of MPIsed blast. The functionality is at the moment limited to protein sequences, but not for long :) It was primarily designed for a specific task of all vs. all comparison + clustering and was in use like that for about a year, now I'm generalizing the code a bit. it: interfaces to the SQL database (MySQL) and can directly store results into it can output results in a text file - one line per HSP, comma separated fields. can operate on fasta files, formatting them on the fly (each is then stored locally on the worknode). can filter results based on HSP length coverage (% of HSP len/ total subject or query len) and score Here are the benchmarks, on single processor Athlon 1.3G machines. A subset of swissprot (2000 sequences) was run against the entire database (~86000 entries). the number of HSPs is the total number of results from running 2000 seqs vs swissprot. There is no limit on the number of HSPs per query run (ususlly 200 with blast). E cutoff was 10. A) SWISSPROT 86593 sequences, SQL database single table 196963 HSPs (2000 queries) in 5nodes: 7m21 -> 272 queries/min -> 4.5 queries/s 10nodes: 3m51s -> 520 queries/min -> 8.6 queries/s 15nodes: 2m39s -> 750 queries/min -> 12.5 queries/s B) SWISSPROT 86593 sequences, 10 worker nodes, text output 196963 HSPs (2000 queries) in 5nodes: 7m22s -> 271 queries/min -> 4.52 queries/s 10nodes: 3m45s -> 533 queries/min -> 8.8 queries/s 15nodes: 2m37s -> 765 queries/min -> 12.7 queries/s anyone interested in beta (alpha, that is :) testing can drop me an email... Kris > -----Original Message----- > From: bioclusters-admin@bioinformatics.org > [mailto:bioclusters-admin@bioinformatics.org]On Behalf Of Wim Glassee > Sent: Friday, June 07, 2002 14:14 > To: bioclusters@bioinformatics.org > Subject: RE: [Bioclusters] Parallel blast > > > > > > -----Original Message----- > > From: bioclusters-admin@bioinformatics.org [mailto:bioclusters- > > admin@bioinformatics.org] On Behalf Of chris dagdigian > > Sent: vrijdag 7 juni 2002 13:56 > > To: bioclusters@bioinformatics.org > > Subject: Re: [Bioclusters] Parallel blast > > > > > > Hi Wim, > > > > This will be a quickie response... > > > > With newer versions of ncbi-blast there are 2 things that have made > the > > process of splitting up the target databases so that your query can be > > multiplexed across multiple searches and machines far easier: > > > > o The "-z" option switch (used to be undocumented I think?) allows you > > to override/tell the blastall binary the effective size of the > database. > > If you feed the original (large) value to the blastall binary while > > searching against the small slice you will at least get back the > correct > > scores and statistics. This is a huge time and accuracy saver as > trying > > to parse and adjust these values after the fact is a giant error-prone > > excercise in pain. > > > > I've been messing around with blast for quite a while now. I'm using the > -z flag and the xml output to eventually get a result that is identical > to the normal blast output (without partitioning). What few people know > is that the -z parameter alone is not enough. The statistics of blast > are also based on the number of sequences in the database. For this > given, there is no parameter for blastall. > > Likewise, I understand that some people divide query sequences as well. > If you want your results to be the same, you have to let blast know just > how big your 'original' query was, so it can calculate its statistics > correctly. > > I'm working on a solution that does both these things and merges the > output files, but I'm afraid it's not as easy as it sounds. > > I'm just wondering if there any other parallel blast solutions, so I can > spare me the hassle of trying to do this myself > > > o XML output of results > > > > Having the scores and statistics correct while getting the results > back > > in a way that is far easier to parse than the human readable version > is > > 95% of the battle. Everything else is fairly simple. > > The XML output (although it has a pretty strange format) helps, that's > for sure! > > Wim > > > > > -Chris > > > > > > Wim Glassee wrote: > > > > ><snip> > > > > > >I've noticed some people cut their databases and query sequences to > > >smaller pieces, with or without overlap, and perform separate blasts. > > >But how do you put them back together again? And are the results the > > >same? > > > > > >Wim > > > > > > > > > > > > > > > > > > -- > > Chris Dagdigian, <dag@sonsorol.org> > > Life Science IT & Research Computing Consultant > > Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193 > > Work: http://bioteam.net PGP KeyID: 83D4310E Yahoo IM: craffi > > > > > > > > _______________________________________________ > > Bioclusters maillist - Bioclusters@bioinformatics.org > > http://bioinformatics.org/mailman/listinfo/bioclusters > > > _______________________________________________ > Bioclusters maillist - Bioclusters@bioinformatics.org > http://bioinformatics.org/mailman/listinfo/bioclusters >