[BiO BB] BLAST problem: limiting # of HSPs

Dan Bolser dmb at mrc-dunn.cam.ac.uk
Sat Mar 27 07:47:15 EST 2004


On Fri, 26 Mar 2004, Kerr Wall wrote:

> On 3/26/04 12:01 PM, "Dan Bolser <dmb at mrc-dunn.cam.ac.uk>" wrote:
> 
> >> In the default blast output, there are summary statistics for the overall
> >> hit, is there an option for the tab-deliminated BLAST output that would give
> >> us this overall hit statistic instead of one for each HSP?
> > 
> > 
> > I think you can simply sum the e-values for each non overlapping HSP (I
> > think they shouldn't overlap). Anybody know the correct formula?
> 
> I can handle non overlapping HSP's because I would only be parsing out the
> best evalue from each hit.  I'm just trying to avoid it if at all possible.
> I'm running a tblastx of ~ 1,000,000 cdna's against themselves to produce a
> similarity matrix.  Therefore, I'm more worried about the size of the output
> files and making sure that I don't run out of similarities between more
> distantly related genes that might get left out of the output when the
> maximum number of hits is reached (for some of the larger gene families).  I
> need to make sure the matrix is as symmetrical as possible.


Have you seen 

http://www.ebi.ac.uk/research/cgg/tribe/

and

http://micans.org/mcl/

?

They provide tools to make a symmetrical all V all similarity matrix (I
think it is an interface to blastall).


> >> If not, is there an option to limit the number of HSPs returned in the
> >> tab-deliminated output?
> > 
> > I am sure there is a way to do this, but I can't find any mention of this
> > option in the 
> > 
> > ncbi/doc/blast.txt
> 
> Yes, I know.  They don¹t even discuss all of the options in that file.  You
> would think that the documentation for blast would be complete considering
> how long it has been around.

:)

Have you tried the man pages?

ncbi/doc/man/


> > Hmm.... Not sure if these have anything to do with it...
> > 
> > -K N (blastall, blastcl3, blastpgp)
> >      Number  of  best  hits from a region to keep (off by default, if
> >      used a value of 100 is recommended)
> > 
> > -P N (blastall, blastpgp, rpsblast)
> >      Set to  1  for  single-hit  mode  or  0  for  multiple-hit
> >      mode (default)
> > 
> > -b N (blastall, blastcl3, blastpgp, impala, megablast, rpsblast, seed-
> >     top)
> >      Number of database sequences to show alignments for (B) (default
> >      is 250)
> 
> Thanks.  Those are the parameters I've been working with so far.  I did find
> a paragraph in the documentation that might be on this same track.
> Specifically #4 in the section "Notes for 2.0.6 release":
> 
> 
> ############################################################################
> Notes for 2.0.6 release:
> 
> Enhancements:
> 
> ...
> 
> 4.) BLAST has been changed to reduce the number of redundant hits that a
> user may see.  This is acheived by keeping track of the number of hits
> completely contained in a certain region and eliminating those lower scoring
> hits that are redundant with others.  This behavior may be controlled with
> the -K and -L options:
> 
>   -K  Number of best hits from a region to keep [Integer]
>     default = 50
>   -L  Length of region used to judge hits [Integer]
>     default = 20
> 
> Setting -K to zero turns off this feature.  This is the default only on
> blastall.
> ############################################################################


Cheers.


> Of course, when you get a list of all the options 'blastall -', the L option
> is labeled as '-L  Location on query sequence [String]  Optional'.  Not sure
> what to make of that?  I wonder if they have changed parameter names from
> 2.0.6 to 2.2.8?

Tipical problem!

blast.1

-L start,stop (blastall, blastcl3, megablast, rpsblast)
   Location on query sequence (for rpsblast, only valid  in blastp mode)

blastclust.1

-L X   Length coverage threshold (default = 0.9)

?
 

> It looks as if setting K = 1 and using L > 100 (or much larger) would help
> me reduce the number of output.  I think also using P = 1 as you stated
> above would probably help out the most.
> 
> > If you get an answer from blast-help at ncbi.nlm.nih.gov can you please post
> > it up? (these emails get archived).
> 
> I will.  I sent them an email yesterday afternoon so I won't be expecting
> anything back until sometime next week.  I usually have solved the problem
> by the time they get back to me.


They are very buisy I guess. 

Best of luck!

Dan.


> 
> Thanks for the help,
> 
> Kerr
> 
> 
> > Cheers,
> > Dan.
> > 
> >> 
> >> Thanks,
> >> 
> >> Kerr Wall
> >> 
> >> _______________________________________________
> >> BiO_Bulletin_Board maillist  -  BiO_Bulletin_Board at bioinformatics.org
> >> https://bioinformatics.org/mailman/listinfo/bio_bulletin_board
> >> 
> > 
> > 
> > 
> > --__--__--
> > 
> > _______________________________________________
> > BiO_Bulletin_Board maillist  -  BiO_Bulletin_Board at bioinformatics.org
> > https://bioinformatics.org/mailman/listinfo/bio_bulletin_board
> > 
> > 
> > End of BiO_Bulletin_Board Digest
> > 
> 
> _______________________________________________
> BiO_Bulletin_Board maillist  -  BiO_Bulletin_Board at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bio_bulletin_board
> 




More information about the BBB mailing list