[BiO BB] Papers/techniques on early culling of Smith-Waterman operation?
Theodore H. Smith
delete at elfdata.com
Thu Jun 15 19:01:29 EDT 2006
Hi people,
Do you know of any technique to end a Smith-Waterman operation early,
by saying that if the operation cannot find a match of a certain
desired minimum strength, then we won't do anymore? We'll end the
operation early? We would at every row, check the "Best score we can
possibly get based upon the current best score". If the "Best score
we could possibly get" is below our desired score, we end the SW
operation.
Or do you know of any papers on this subject? Or even can you think
of good keywords for me to search with? Perhaps the way I'm
describing it isn't the best way.
So, let's say you have a standard SW operation. You have a square
Matrix, and you run from left to right across the matrix, line by
line down the matrix. Let's say, the matrix has 90 rows, as we are
trying to match a protein sequence of 90 letters.
Now, is there anyway, some way, by logical analysis or guesswork, to
tell perhaps by the time we get to row 70, we know that we'll not be
able to get any better score matches than the one we got, and thus
cull the SW operation? So that we don't search through rows 71 to 90?
The numbers 70 or 90 don't matter, as long as we can check at every
row, whether the "best match score" gets better or not.
I have something like this for a global alignment producer, and it
slices the time down to a small fraction!! Even though it only kills
the last third or so, of the alignment, it still cuts down the time
required down to less than 66%, perhaps even down to 20%.
Basically, I've spent a few months, as a hobby, putting my inventive
capacity into making a SmithWaterman-based, BLAST-like searcher. I do
things like this because I can, although I think this is probably the
most complex thing I'll ever work on in my entire life.
My BLAST-like searcher, stores all of it's data in a tree (a "trie"
actually). Protein sequences that are shared across many proteins,
are stored in the root or earlier branches of the tree. So any work
done on the early parts of the tree is reused across many many
proteins. This is good, it saves CPU-time. Most of the work then, is
done processing the leaves of the tree.
This is why the last few rows of the SW Matrix are so important,
because most of the work is done there to process them! This is why I
want to shave the last few rows off, when possible. But of course,
the idea is to process the entire leaf, if the "best score match" is
going to get better. For that reason some kind of quick hueristic
predictive trick would be desired, so that we don't cull good leaves.
Any ideas anyone? This would mean so much to me if I can find a
paper, or if anyone even knows what I am talking about and can give
me some help.
If anyone doesn't know what I am talking about but feels like helping
should you understand me, please let me know that I need to explain
this in a different better way?
I've spent many months on this algorithm. I have a nearly identical
algorithm that already works to do global alignmnents, it just
doesn't do local alignments, and the global aligner is pretty quick
actually. It's just the CPU-saving tricks I've done with global
alignment, don't seem to cross over to local alignment, and I'm
scratching my brain wondering if these past few months were a waste
or not.
Because of this "loss of tricks" (that worked with global alignment)
situation, I am asking for new tricks, here!
Thankfully, I was smart enough to do this work part-time, as a
hobby :) I also have a proper normal job that doesn't rely on the
success of proposed inventions to make me money. I decided it wasn't
a good idea to starve myself just for an idea that might never
produce anything worthwhile.
Help anyone! Thanks a lot :)
--
http://elfdata.com/plugin/
More information about the BBB
mailing list