[Bioclusters] Re: Top hits, not so top

Fri Jan 6 15:15:57 EST 2006

On Thu, 2006-01-05 at 18:16 +1100, Andrew.Mather at dpi.vic.gov.au wrote:
> 
> Hi All, 
> 
> One of my users has encountered some odd behaviour when trying to
> blast a 100MB query sequence against a human genomic sequence
> database. 
> 
> His message is below, but basically, he's finding he gets different
> results depending on how many alignments andone-line descriptors he
> asks to see.  The input sequence, database, e-values etc remain
> constant, it's only the -v and -b options that change.
> 
> We're using Blastall v2.2.11 (newer one is in testing) on Intel
> machines running RHEL3. 
> 
> Can anyone point me in an appropriate direction for things to look at
> please ? 
> 
> Thanks, 
> Andrew
> 
> Bioinformatics Advanced Scientific Computing, 
> Animal Genetics and Genomics, PIRVic Attwood
> 475 Mickleham Road, Attwood, 3049
> ph +61 3 92174342
> mob  0413 009 761

    Andrew,

        I can't be sure without looking at the alignments, but I think I
        can explain what's going on. 

        This is a VERY simplified description of what NCBI BLAST does. 
        Please don't take it literally.  We can discuss the intricate 
        details in private e-mail if you wish.


        NCBI BLAST first finds a lot of ungapped hits between the query
        and target sequences and keep the best scoring set of (mumble) 
        hits. (I don't remember the number)

        Then, for each ungapped hit, it performs a gapped alignment to 
        get the final hit shown in the output. (Sort of. It's 
        complicated as I said.)

        Often, the best-scoring ungapped hit does not produce the best-
        scoring gapped hit.

        If you keep just the top 10 ungapped hits and compute the gapped
        alignments for them, you get 10 hits with scores and e-values.

        Sometimes, ungapped hit#11 will be the one that results in the
        best gapped alignment, but since you only kept the best 10 
        ungapped hits, you don't see hit#11 in the result set. 

        Sometimes,ungapped hit#1105 will be the one that results in the
        best gapped alignment

        The easiest way to avoid this is to ask for lots of hits, but
        that obviously costs compute cycles.

        There are other things we do at TimeLogic to get around this
        problem, but they are specific to our hardware accelerated 
        search systems. I would be happy to discuss them in private 
        e-mail if you are interested so that this list can remain on the
        subject of bioclusters.

        If you run a test asking for the top 100 ungapped hits, and then
        run a test asking for the top 10 gapped hits, I think you will 
        see just what's going on here.

        Please let us know your results one way or the other.

                                -Alan

-- 
- Alan Kilian <kilian(at)timelogic.com>
Director of Bioinformatics, TimeLogic Corporation 763-449-7622