[Bioclusters] x86 port of Apple-Genentech blast
David Huen
bioclusters@bioinformatics.org
Thu, 26 Jun 2003 18:25:42 +0100
There is nothing changed in this port except:-
i) comment out the Altivec prefetch code
ii) remove Apple version of Altivec'd BlastNtFindWords and restore the
original 2001/12/20 version
I have tested it on RedHat 9. You may have to fiddle with the make system
on whatever platform you use to convince it not to mess around with MOTIF.
It is still based on the 2001/12/20 version of NCBI blast. It should
relatively trivial to move the changes into the latest version.
Get it at: http://gadfly.gen.cam.ac.uk/~davidh/ncbi-ag-x86.tgz
I no longer have access to the G4 I used for verification to check that the
version I have left here gives identical outputs to the G4 version - it
should. The results will not be identical to the NCBI 2001/12/20 version
as the order in which hits are found is different in the Apple code. I
would appreciate that someone checks that it still does. And that someone
attempts to get performance numbers across a range of platforms.
The reason for improved performance is simple: the NCBI version redundantly
expands hits it has already logged earlier other than at a default wordsize
of 11.
NCBI Blastn uses an initial comparison to determine if it should attempt to
extend the match to the required wordsize. A match here will cause it to
attempt to look up and extend against all positions in its hash table to
reach the target wordsize. Success on that triggers a final extension
attempt to get as far as it can. The extension proceeds on either a
byte-vs-byte manner and terminates with a base-vs-base manner. base-base
extensions are made 3-symbols to left and right.
The problem is at low wordlength (< 11), NCBI Blastn will use a 4-base
initial comparison (1 byte). Evidently this will generate a humongous
number of almost certainly unsuccessful extension attempts against the hash
table positions.
At wordlength>10, Blastn uses an 8-base initial comparison. For
wordlength=11, this will result in that being followed by 3-base extensions
to the left and right. Should the range achieved on summing the left and
right extensions reach the amount required to get 11, it will then do the
full extension attempt. After doing this for all matching entries in its
hash table, it will advance the search position by 4 bases (1 byte).
For wordlengths > 11, the initial match is still done with 8-bases but the
subsequent extension would be performed by adding additional bytes (4-base)
searches in the forward direction. The final 3-base extension to left and
right is still performed. The advance is still by a single byte.
If I have analysed their code correctly, the flaw lies in the advance not
being changed appropriately with large wordsizes. It is easily
demonstrated that with increasing wordsizes, the probability of redundantly
exploring hits that already have been explored from the previous advance
cycle increases sharply. I think this is why its performance gets trashed
so badly by AG Blast in large wordsizes. The logic of optimal stepsize is
simple although the need to quantise to byte/word/etc boundaries
complicates it slightly. Nevertheless, the strategy is optimal at
wordsize=11 (which makes me wonder which was chosen first, the
implementation or the optimal wordsize ;-) ).
The Apple code is capable of tuning its initial matchsize and stepsize to
match the wordsize requirements. This has benefits at wordsizes lower and
higher than 11. At lower wordsizes, it means you can avoid using the
initial matchsize of 4. At higher wordsizes, you can now step thru at a
larger stepsize.
So this is why I think the AG code works better and on all platforms. I am
disappointed that they didn't just say so. As Ian Korf correctly pointed
out, wordsizes < 11 are useful.
Regards,
David Huen