[Bioclusters] Re: Top hits, not so top

Botka, Christopher Christopher.Botka at joslin.harvard.edu
Tue Jan 10 12:39:44 EST 2006


Looks like this is due, as Alan wrote, to the way blast BLAST sorts the
database hits during the three blastn extension phases (ungapped,
gapped, and gapped with traceback).  Any given alignment may improve
during during each extension step, so if you throw one out because of a
-b limit you lose it.  (blast seems to look at -b at each
extension/evaluation step, though I have not found this documented
anyplace) with a larger -b you keep more alignments for each extension
phase. So, an ungapped alignment that does not make the -b1 cutoff,
might have been improved in the gapped or gapped w/traceback extension
phases thus giving a higher score.  Gapped extensions are only triggered
if the ungapped thresholds are satisfied (Thanks to Wayne Matten @ NCBI
for explaining this to me).
 
The question I have is why use BLAST here at all?  It looks like the
user expects to align a relatively long strech of the query to the DB
sequence (mapping), so something like BLAT would work better here.  I
know there are high memory requirements with BLAT, but I would imagine
this search with BLAST will use up quite a bit of memory, and also
BLASTN with lots of HSPs is computationally quite expensive (computation
is quadratic in the number of HSPs).  If you wanted to use BLAST, it
seems like you could do a few things to improve performance (rather than
lower -b/-v) like use megablast or splitting up the query or DB files or
just increasing the word size. BLAST (formatdb) does have that cool
database alias function you can use to virtually split DBs (but fasta
sequences have to have GI numbers for it to work).
 
Cheers,
 
Chris
 
 

  _____  

	From:
bioclusters-bounces+christopher.botka=joslin.harvard.edu at bioinformatics.
org
[mailto:bioclusters-bounces+christopher.botka=joslin.harvard.edu at bioinfo
rmatics.org] On Behalf Of Andrew.Mather at dpi.vic.gov.au
	Sent: Thursday, January 05, 2006 2:16 AM
	To: Bioclusters at bioinformatics.org
	Subject: [Bioclusters] Re: Top hits, not so top
	
	

	Hi All, 
	
	One of my users has encountered some odd behaviour when trying
to blast a 100MB query sequence against a human genomic sequence
database. 
	
	His message is below, but basically, he's finding he gets
different results depending on how many alignments andone-line
descriptors he asks to see.  The input sequence, database, e-values etc
remain constant, it's only the -v and -b options that change.
	
	We're using Blastall v2.2.11 (newer one is in testing) on Intel
machines running RHEL3. 
	
	Can anyone point me in an appropriate direction for things to
look at please ? 
	
	Thanks, 
	Andrew
	
	Bioinformatics Advanced Scientific Computing, 
	Animal Genetics and Genomics, PIRVic Attwood
	475 Mickleham Road, Attwood, 3049
	ph +61 3 92174342
	mob  0413 009 761
	
	
	----------------
	There are 10 kinds of people...those who understand binary and
those who don't.
	
	----- Forwarded by Andrew Mather/NRE on 05/01/06 06:08 PM ----- 
	
	
	The following problem was discovered with BLAST jobs where you
ask for 
	different numbers of hits to be presented. 
	
	A 100 MB segement of bovine Chromosome fasta was 
	BLASTED onto Human with different -v and -b options to print the
"Top" 
	hits. 
	
	The following were all run on -intel ques 
	
	Consider this where we ask for the top hit only 
	
	/usr/local/bin/blastall -p blastn -i test.out -d
"/blastdb/human_genomic" -e 1e-25 -v 1 -b 1 -a4 -o one.b 
	lastn 
	
	This is what I get 
	
Score    E 
	                            Sequences producing significant
alignments:                      (bits) Value 
	
	ref|NC_000001.8|NC_000001 Homo sapiens chromosome 1, complete
se...   224   8e-54 
	
	Now if I ask for 5 top hits ..... 
	
	
	/usr/local/bin/blastall -p blastn -i test.out -d
"/blastdb/human_genomic" -e 1e-25 -v 5 -b 5 -a4 -o five.blastn 
	
	This is what I get 
	
	
Score    E 
	                             Sequences producing significant
alignments:                      (bits) Value 
	
	ref|NC_000004.9|NC_000004 Homo sapiens chromosome 4, complete
se...   281   4e-71 
	ref|NC_000001.8|NC_000001 Homo sapiens chromosome 1, complete
se...   224   8e-54 
	ref|NC_000013.9|NC_000013 Homo sapiens chromosome 13, complete
s...   222   3e-53 
	ref|NC_000005.8|NC_000005 Homo sapiens chromosome 5, complete
se...   214   8e-51 
	ref|NC_000009.9|NC_000009 Homo sapiens chromosome 9, complete
se...   206   2e-48 
	
	Note, there is now a hit on Chromosome 4 that is above the top
hit on 
	Chromosome 1 that was seen previously.  The hit on chromosome 1
is the 
	same as was seen previously though. 
	
	It gets worse ...  what if we ask for the top 7 hits? 
	
	/usr/local/bin/blastall -p blastn -i test.out -d 
	"/blastdb/human_genomic" -e 1e-25 -v 7 -b 7 -a4 -o seven.blastn 
	
	This is what we get 
	
	
Score    E 
	Sequences producing significant alignments:
(bits) Value 
	
	ref|NC_000004.9|NC_000004 Homo sapiens chromosome 4, complete
se...   281   4e-71 
	ref|NC_000011.8|NC_000011 Homo sapiens chromosome 11, complete
s...   230   1e-55 
	ref|NC_000001.8|NC_000001 Homo sapiens chromosome 1, complete
se...   224   8e-54 
	ref|NC_000013.9|NC_000013 Homo sapiens chromosome 13, complete
s...   222   3e-53 
	ref|NC_000005.8|NC_000005 Homo sapiens chromosome 5, complete
se...   214   8e-51 
	ref|NC_000009.9|NC_000009 Homo sapiens chromosome 9, complete
se...   206   2e-48 
	ref|NC_000023.8|NC_000023 Homo sapiens chromosome X, complete
se...   198   5e-46 
	
	
	Now we have a hit on Chromosome 11 that pops up between the top
hit on 
	Chromosome 4 and our original hit on Chromosome 1. 
	
	Does it get any worse? I tried the top 10 hits ... 
	
	/usr/local/bin/blastall -p blastn -i test.out -d
"/blastdb/human_genomic" -e 1e-25 -v 10 -b 10 -a4 -o ten 
	.blastn 
	
	Here are the results 
	
	
Score    E 
	Sequences producing significant alignments:
(bits) Value 
	
	ref|NC_000004.9|NC_000004 Homo sapiens chromosome 4, complete
se...   281   4e-71 
	ref|NC_000011.8|NC_000011 Homo sapiens chromosome 11, complete
s...   230   1e-55 
	ref|NC_000001.8|NC_000001 Homo sapiens chromosome 1, complete
se...   224   8e-54 
	ref|NC_000013.9|NC_000013 Homo sapiens chromosome 13, complete
s...   222   3e-53 
	ref|NC_000005.8|NC_000005 Homo sapiens chromosome 5, complete
se...   214   8e-51 
	ref|NC_000009.9|NC_000009 Homo sapiens chromosome 9, complete
se...   206   2e-48 
	ref|NC_000008.9|NC_000008 Homo sapiens chromosome 8, complete
se...   202   3e-47 
	ref|NC_000023.8|NC_000023 Homo sapiens chromosome X, complete
se...   198   5e-46 
	ref|NC_000002.9|NC_000002 Homo sapiens chromosome 2, complete
se...   198   5e-46 
	ref|NC_000014.7|NC_000014 Homo sapiens chromosome 14, complete
s...   194   7e-45 
	
	
	No more hits jumping in near the top... but we do have a hit
that pops 
	in between 9 and X (8). 
	
	So what is going on?  The same BLAST job on the same machine ...
just 
	changing the -v and -b option to pick the number of hits to
display. 
	
	

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://bioinformatics.org/pipermail/bioclusters/attachments/20060110/9f741cc0/attachment.html


More information about the Bioclusters mailing list