[BiO BB] BLAST problem: limiting # of HSPs
Dan Bolser
dmb at mrc-dunn.cam.ac.uk
Sat Mar 27 07:47:15 EST 2004
On Fri, 26 Mar 2004, Kerr Wall wrote:
> On 3/26/04 12:01 PM, "Dan Bolser <dmb at mrc-dunn.cam.ac.uk>" wrote:
>
> >> In the default blast output, there are summary statistics for the overall
> >> hit, is there an option for the tab-deliminated BLAST output that would give
> >> us this overall hit statistic instead of one for each HSP?
> >
> >
> > I think you can simply sum the e-values for each non overlapping HSP (I
> > think they shouldn't overlap). Anybody know the correct formula?
>
> I can handle non overlapping HSP's because I would only be parsing out the
> best evalue from each hit. I'm just trying to avoid it if at all possible.
> I'm running a tblastx of ~ 1,000,000 cdna's against themselves to produce a
> similarity matrix. Therefore, I'm more worried about the size of the output
> files and making sure that I don't run out of similarities between more
> distantly related genes that might get left out of the output when the
> maximum number of hits is reached (for some of the larger gene families). I
> need to make sure the matrix is as symmetrical as possible.
Have you seen
http://www.ebi.ac.uk/research/cgg/tribe/
and
http://micans.org/mcl/
?
They provide tools to make a symmetrical all V all similarity matrix (I
think it is an interface to blastall).
> >> If not, is there an option to limit the number of HSPs returned in the
> >> tab-deliminated output?
> >
> > I am sure there is a way to do this, but I can't find any mention of this
> > option in the
> >
> > ncbi/doc/blast.txt
>
> Yes, I know. They don¹t even discuss all of the options in that file. You
> would think that the documentation for blast would be complete considering
> how long it has been around.
:)
Have you tried the man pages?
ncbi/doc/man/
> > Hmm.... Not sure if these have anything to do with it...
> >
> > -K N (blastall, blastcl3, blastpgp)
> > Number of best hits from a region to keep (off by default, if
> > used a value of 100 is recommended)
> >
> > -P N (blastall, blastpgp, rpsblast)
> > Set to 1 for single-hit mode or 0 for multiple-hit
> > mode (default)
> >
> > -b N (blastall, blastcl3, blastpgp, impala, megablast, rpsblast, seed-
> > top)
> > Number of database sequences to show alignments for (B) (default
> > is 250)
>
> Thanks. Those are the parameters I've been working with so far. I did find
> a paragraph in the documentation that might be on this same track.
> Specifically #4 in the section "Notes for 2.0.6 release":
>
>
> ############################################################################
> Notes for 2.0.6 release:
>
> Enhancements:
>
> ...
>
> 4.) BLAST has been changed to reduce the number of redundant hits that a
> user may see. This is acheived by keeping track of the number of hits
> completely contained in a certain region and eliminating those lower scoring
> hits that are redundant with others. This behavior may be controlled with
> the -K and -L options:
>
> -K Number of best hits from a region to keep [Integer]
> default = 50
> -L Length of region used to judge hits [Integer]
> default = 20
>
> Setting -K to zero turns off this feature. This is the default only on
> blastall.
> ############################################################################
Cheers.
> Of course, when you get a list of all the options 'blastall -', the L option
> is labeled as '-L Location on query sequence [String] Optional'. Not sure
> what to make of that? I wonder if they have changed parameter names from
> 2.0.6 to 2.2.8?
Tipical problem!
blast.1
-L start,stop (blastall, blastcl3, megablast, rpsblast)
Location on query sequence (for rpsblast, only valid in blastp mode)
blastclust.1
-L X Length coverage threshold (default = 0.9)
?
> It looks as if setting K = 1 and using L > 100 (or much larger) would help
> me reduce the number of output. I think also using P = 1 as you stated
> above would probably help out the most.
>
> > If you get an answer from blast-help at ncbi.nlm.nih.gov can you please post
> > it up? (these emails get archived).
>
> I will. I sent them an email yesterday afternoon so I won't be expecting
> anything back until sometime next week. I usually have solved the problem
> by the time they get back to me.
They are very buisy I guess.
Best of luck!
Dan.
>
> Thanks for the help,
>
> Kerr
>
>
> > Cheers,
> > Dan.
> >
> >>
> >> Thanks,
> >>
> >> Kerr Wall
> >>
> >> _______________________________________________
> >> BiO_Bulletin_Board maillist - BiO_Bulletin_Board at bioinformatics.org
> >> https://bioinformatics.org/mailman/listinfo/bio_bulletin_board
> >>
> >
> >
> >
> > --__--__--
> >
> > _______________________________________________
> > BiO_Bulletin_Board maillist - BiO_Bulletin_Board at bioinformatics.org
> > https://bioinformatics.org/mailman/listinfo/bio_bulletin_board
> >
> >
> > End of BiO_Bulletin_Board Digest
> >
>
> _______________________________________________
> BiO_Bulletin_Board maillist - BiO_Bulletin_Board at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bio_bulletin_board
>
More information about the BBB
mailing list