[BiO BB] Papers/techniques on early culling of Smith-Waterman operation?

Thu Jun 15 19:01:29 EDT 2006

Hi people,

Do you know of any technique to end a Smith-Waterman operation early,  
by saying that if the operation cannot find a match of a certain  
desired minimum strength, then we won't do anymore? We'll end the  
operation early? We would at every row, check the "Best score we can  
possibly get based upon the current best score". If the "Best score  
we could possibly get" is below our desired score, we end the SW  
operation.

Or do you know of any papers on this subject? Or even can you think  
of good keywords for me to search with? Perhaps the way I'm  
describing it isn't the best way.

So, let's say you have a standard SW operation. You have a square  
Matrix, and you run from left to right across the matrix, line by  
line down the matrix. Let's say, the matrix has 90 rows, as we are  
trying to match a protein sequence of 90 letters.

Now, is there anyway, some way, by logical analysis or guesswork, to  
tell perhaps by the time we get to row 70, we know that we'll not be  
able to get any better score matches than the one we got, and thus  
cull the SW operation? So that we don't search through rows 71 to 90?  
The numbers 70 or 90 don't matter, as long as we can check at every  
row, whether the "best match score" gets better or not.

I have something like this for a global alignment producer, and it  
slices the time down to a small fraction!! Even though it only kills  
the last third or so, of the alignment, it still cuts down the time  
required down to less than 66%, perhaps even down to 20%.

Basically, I've spent a few months, as a hobby, putting my inventive  
capacity into making a SmithWaterman-based, BLAST-like searcher. I do  
things like this because I can, although I think this is probably the  
most complex thing I'll ever work on in my entire life.

My BLAST-like searcher, stores all of it's data in a tree (a "trie"  
actually). Protein sequences that are shared across many proteins,  
are stored in the root or earlier branches of the tree. So any work  
done on the early parts of the tree is reused across many many  
proteins. This is good, it saves CPU-time. Most of the work then, is  
done processing the leaves of the tree.

This is why the last few rows of the SW Matrix are so important,  
because most of the work is done there to process them! This is why I  
want to shave the last few rows off, when possible. But of course,  
the idea is to process the entire leaf, if the "best score match" is  
going to get better. For that reason some kind of quick hueristic  
predictive trick would be desired, so that we don't cull good leaves.

Any ideas anyone? This would mean so much to me if I can find a  
paper, or if anyone even knows what I am talking about and can give  
me some help.

If anyone doesn't know what I am talking about but feels like helping  
should you understand me, please let me know that I need to explain  
this in a different better way?

I've spent many months on this algorithm. I have a nearly identical  
algorithm that already works to do global alignmnents, it just  
doesn't do local alignments, and the global aligner is pretty quick  
actually. It's just the CPU-saving tricks I've done with global  
alignment, don't seem to cross over to local alignment, and I'm  
scratching my brain wondering if these past few months were a waste  
or not.

Because of this "loss of tricks" (that worked with global alignment)  
situation, I am asking for new tricks, here!

Thankfully, I was smart enough to do this work part-time, as a  
hobby :) I also have a proper normal job that doesn't rely on the  
success of proposed inventions to make me money. I decided it wasn't  
a good idea to starve myself just for an idea that might never  
produce anything worthwhile.

Help anyone! Thanks a lot :)

--
http://elfdata.com/plugin/