One solution which springs to mind (but isn't yet 'off the shelf') is to customize cd-hit a bit. By default cd-hit clusters sequences, but it does this using a 'words in common' heuristic to filter sequences which are likely to be below a certain identity threshold. If you need heavy duty calculation, and you are OK with c / c++, modifing cd-hit would be the best bet (and the cd-hit project is trying to attract developers!). Cheers, Dan. On Wed, 3 Mar 2004, Joseph Bedell wrote: > Hi Andrew, > > I'm cross-posting your question to the Sequence Search Mailing List > (SSML). This should be a good place for a discussion of your problem. > > https://bioinformatics.org/mailman/listinfo/ssml-general > > Are you looking for only 5-7bp matches with no extension? How big is > your oligo? One parameter that would need adjustment is E which should > probably be set outrageously high (1e-10?). Can you share the seq3.fasta > sequence? I could try blasting against refseq too or against some > sequence that you know it should hit. > > Regards, > Joey > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Joseph A Bedell, Ph.D. > Director, Bioinformatics > Orion Genomics, LLC > 4041 Forest Park Ave. > St. Louis, MO 63108 > Office:(314)615-6979; Fax:(314)615-6975 > Mobile:(314)518-1343 > http://www.oriongenomics.com > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > >-----Original Message----- > >From: bioperl-l-bounces at portal.open-bio.org [mailto:bioperl-l- > >bounces at portal.open-bio.org] On Behalf Of Andrew Walsh > >Sent: Wednesday, March 03, 2004 1:52 PM > >To: bioperl-l at portal.open-bio.org > >Subject: [Bioperl-l] Small word sizes with BLAST (WU, NCBI) > > > >Hello, > > > >My question is not really related to a specific Bioperl library, so I > >apologize. If there is a specific 'BLAST' newsgroup, I will be happy > to > >post there. But I was hoping somebody on the Bioperl list had some > >experience doing nucleic acid searches with small word sizes. > > > >I would like to search for small (5-7) bp matches between an oligo > >sequence > >and a ~100,000 mRNA database. I've tried doing this with WU-BLAST and > >NCBI-BLAST. NCBI-BLAST does not allow word sizes below 7, so I've > tried > >lots of different command line parameters for WU-BLAST. > > > >I've tried these searches with versions 2.0a19 (alpha) and 2.0 of > WU-BLAST. > > > >I get quite strange results when I start lowering the word size below > the > >default (11). For example, with the alpha version, I get more hits > with a > >word size of 10 than I do with a word size of 7. With the beta > version, I > >get the same number of hits with word sizes 10 and 7. I've checked > this by > >hand, and the 'missing' hits do in fact have stretches of 7 continuous > bps > >matching. > > > >Here is an example of one of the command lines I've tried running: > >blastn human_refseq.fasta seq3.fasta W=5 S=5 M=1 V=100000 B=100000 > > > >I've tried adjusting every parameter I thought would affect the search > >results, but still cannot recover the 'missing' hits. > > > >Maybe BLAST is the wrong tool for this. I'd just like something that's > >fast. If anyone has some advice, it would be greatly appreciated. > > > >Thanks a lot, > > > >Andrew > > > >_________________________________________________________________ > >Add photos to your messages with MSN 8. Get 2 months FREE*. > >http://join.msn.com/?page=dept/features&pgmarket=en- > >ca&RU=http%3a%2f%2fjoin.msn.com%2f%3fpage%3dmisc%2fspecialoffers%26pgma > rket > >%3den-ca > > > >_______________________________________________ > >Bioperl-l mailing list > >Bioperl-l at portal.open-bio.org > >http://portal.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > ssml-general mailing list > ssml-general at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/ssml-general >