[Bioclusters] Re: Help on BLAST

Chris Dwan (CCGB) bioclusters@bioinformatics.org
Wed, 28 Aug 2002 11:06:45 -0500 (CDT)


Great answer, Joe.  

I have observed similarly "weird" behavior in trying to build up a
comprehensive picture of alignments from supposedly common sequence
fragments at varying lengths: 

* BAC end reads (100 - 700bp)
* Expressed sequence tags / contigs (100 - 5,000bp)
* Known genes (1,000 - 15,000bp)
* entire BAC clones (100,000bp)
* chromosomes (30,000,000bp)

Depending on how I run the searches, I get quite different results.
Simply BLASTing each of the smaller sequence types against the
assembled chromosomes would seem the logical thing to do.  In
reality, you run into Wim's observation that the results are different
than if you build up the hits in pairs:  (small, meduim) -> (medium,
large) -> (large, huge).

My guess is that standard BLAST (and most of the other
identity-anchoring heuristic search algorithms out there) start to be
susceptible to what Joe might describe as "statistical noise" when the
target sequence is chromosome sized.  A smaller scale example of this
is the "shadowing" effect that was well documented when Smith /
Waterman first came out with their algorithm.  A longer, suboptimal
match will sometimes "shadow" shorter optimal ones.

> The important question is whether or not the differences are real and
> biologically relevant, versus being artifacts of the model used for the
> alignment.  There is no guide that I am aware of for this, more of an
> intuition.

Exactly.

> E-values on individual hits vary more than an order of magnitude when
> I update BLAST reports...

> > Very interesting. Could the first case you state have anything to do
> > with the fact that the actual database is probably a lot bigger now than
> > when you first did your blast,

That's exactly what it is.

-Chris