[Bioclusters] x86 port of Apple-Genentech blast

Thu, 26 Jun 2003 18:25:42 +0100

There is nothing changed in this port except:-
i) comment out the Altivec prefetch code
ii) remove Apple version of Altivec'd BlastNtFindWords and restore the 
original 2001/12/20 version

I have tested it on RedHat 9.  You may have to fiddle with the make system 
on whatever platform you use to convince it not to mess around with MOTIF.

It is still based on the 2001/12/20 version of NCBI blast.  It should 
relatively trivial to move the changes into the latest version.

Get it at: http://gadfly.gen.cam.ac.uk/~davidh/ncbi-ag-x86.tgz

I no longer have access to the G4 I used for verification to check that the 
version I have left here gives identical outputs to the G4 version - it 
should.  The results will not be identical to the NCBI 2001/12/20 version 
as the order in which hits are found is different in the Apple code.  I 
would appreciate that someone checks that it still does.  And that someone 
attempts to get performance numbers across a range of platforms.

The reason for improved performance is simple: the NCBI version redundantly 
expands hits it has already logged earlier other than at a default wordsize 
of 11.

NCBI Blastn uses an initial comparison to determine if it should attempt to 
extend the match to the required wordsize.  A match here will cause it to 
attempt to look up and extend against all positions in its hash table to 
reach the target wordsize.  Success on that triggers a final extension 
attempt to get as far as it can.  The extension proceeds on either a 
byte-vs-byte manner and terminates with a base-vs-base manner.  base-base 
extensions are made 3-symbols to left and right.

The problem is at low wordlength (< 11), NCBI Blastn will use a 4-base 
initial comparison (1 byte).  Evidently this will generate a humongous 
number of almost certainly unsuccessful extension attempts against the hash 
table positions.

At wordlength>10, Blastn uses an 8-base initial comparison.  For 
wordlength=11, this will result in that being followed by 3-base extensions 
to the left and right.  Should the range achieved on summing the left and 
right extensions reach the amount required to get 11, it will then do the 
full extension attempt.  After doing this for all matching entries in its 
hash table, it will advance the search position by 4 bases (1 byte).

For wordlengths > 11, the initial match is still done with 8-bases but the 
subsequent extension would be performed by adding additional bytes (4-base) 
searches in the forward direction. The final 3-base extension to left and 
right is still performed.  The advance is still by a single byte.

If I have analysed their code correctly, the flaw lies in the advance not 
being changed appropriately with large wordsizes.  It is easily 
demonstrated that with increasing wordsizes, the probability of redundantly 
exploring hits that already have been explored from the previous advance 
cycle increases sharply.  I think this is why its performance gets trashed 
so badly by AG Blast in large wordsizes.  The logic of optimal stepsize is 
simple although the need to quantise to byte/word/etc boundaries 
complicates it slightly.  Nevertheless, the strategy is optimal at 
wordsize=11 (which makes me wonder which was chosen first, the 
implementation or the optimal wordsize ;-) ).

The Apple code is capable of tuning its initial matchsize and stepsize to 
match the wordsize requirements.  This has benefits at wordsizes lower and 
higher than 11.  At lower wordsizes, it means you can avoid using the 
initial matchsize of 4.  At higher wordsizes, you can now step thru at a 
larger stepsize.

So this is why I think the AG code works better and on all platforms.  I am 
disappointed that they didn't just say so.  As Ian Korf correctly pointed 
out, wordsizes < 11 are useful.

Regards,
David Huen