[ssml] RE: [Bioperl-l] Small word sizes with BLAST (WU, NCBI)

Dan Bolser dmb at mrc-dunn.cam.ac.uk
Wed Mar 3 17:49:11 EST 2004


One solution which springs to mind (but isn't yet 'off the shelf') is to
customize cd-hit a bit. 

By default cd-hit clusters sequences, but it does this using a 'words in
common' heuristic to filter sequences which are likely to be below a
certain identity threshold.

If you need heavy duty calculation, and you are OK with c / c++, modifing
cd-hit would be the best bet (and the cd-hit project is trying to attract 
developers!).

Cheers,
Dan.

On Wed, 3 Mar 2004, Joseph Bedell wrote:

> Hi Andrew,
> 
> I'm cross-posting your question to the Sequence Search Mailing List
> (SSML). This should be a good place for a discussion of your problem.
> 
> https://bioinformatics.org/mailman/listinfo/ssml-general
> 
> Are you looking for only 5-7bp matches with no extension? How big is
> your oligo? One parameter that would need adjustment is E which should
> probably be set outrageously high (1e-10?). Can you share the seq3.fasta
> sequence? I could try blasting against refseq too or against some
> sequence that you know it should hit.
> 
> Regards,
> Joey
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Joseph A Bedell, Ph.D.
> Director, Bioinformatics
> Orion Genomics, LLC
> 4041 Forest Park Ave.
> St. Louis, MO 63108
> Office:(314)615-6979; Fax:(314)615-6975
> Mobile:(314)518-1343
> http://www.oriongenomics.com
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>  
> 
> >-----Original Message-----
> >From: bioperl-l-bounces at portal.open-bio.org [mailto:bioperl-l-
> >bounces at portal.open-bio.org] On Behalf Of Andrew Walsh
> >Sent: Wednesday, March 03, 2004 1:52 PM
> >To: bioperl-l at portal.open-bio.org
> >Subject: [Bioperl-l] Small word sizes with BLAST (WU, NCBI)
> >
> >Hello,
> >
> >My question is not really related to a specific Bioperl library, so I
> >apologize.  If there is a specific 'BLAST' newsgroup, I will be happy
> to
> >post there.  But I was hoping somebody on the Bioperl list had some
> >experience doing nucleic acid searches with small word sizes.
> >
> >I would like to search for  small (5-7) bp matches between an oligo
> >sequence
> >and a ~100,000 mRNA database.  I've tried doing this with WU-BLAST and
> >NCBI-BLAST.  NCBI-BLAST does not allow word sizes below 7, so I've
> tried
> >lots of different command line parameters for WU-BLAST.
> >
> >I've tried these searches with versions 2.0a19 (alpha) and 2.0 of
> WU-BLAST.
> >
> >I get quite strange results when I start lowering the word size below
> the
> >default (11).  For example, with the alpha version, I get more hits
> with a
> >word size of 10 than I do with a word size of 7.  With the beta
> version, I
> >get the same number of hits with word sizes 10 and 7.  I've checked
> this by
> >hand, and the 'missing' hits do in fact have stretches of 7 continuous
> bps
> >matching.
> >
> >Here is an example of one of the command lines I've tried running:
> >blastn human_refseq.fasta seq3.fasta W=5 S=5 M=1 V=100000 B=100000
> >
> >I've tried adjusting every parameter I thought would affect the search
> >results, but still cannot  recover the 'missing' hits.
> >
> >Maybe BLAST is the wrong tool for this.  I'd just like something that's
> >fast.  If anyone has some advice, it would be greatly appreciated.
> >
> >Thanks a lot,
> >
> >Andrew
> >
> >_________________________________________________________________
> >Add photos to your messages with MSN 8. Get 2 months FREE*.
> >http://join.msn.com/?page=dept/features&pgmarket=en-
> >ca&RU=http%3a%2f%2fjoin.msn.com%2f%3fpage%3dmisc%2fspecialoffers%26pgma
> rket
> >%3den-ca
> >
> >_______________________________________________
> >Bioperl-l mailing list
> >Bioperl-l at portal.open-bio.org
> >http://portal.open-bio.org/mailman/listinfo/bioperl-l
> 
> _______________________________________________
> ssml-general mailing list
> ssml-general at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/ssml-general
> 




More information about the ssml-general mailing list