[Biodevelopers] Could banded search help a general purpose Smith-Waterman based algorithm?

Theodore H. Smith delete at elfdata.com
Fri Apr 6 10:01:44 EDT 2007


Hi people,

As a hobbyist (although hopefully I'll make something real people  
will use in the real world), I am working on some bio-informatics  
code involving Smith-Waterman. Sort of like SSEARCH, but faster.

Now, I am trying to make my algorithm fast AND accurate for general  
purpose use. I do not want to lose any accuracy over the full Smith- 
Waterman implementation, so any bandedness might do, will have to be  
banded in such a way that it only ignores cells that would never  
contribute to the final answer anyhow.

That is tricky to implement, making the bandedness accurate. I've got  
some potential solutions which are based around making a second  
matrix, which would be a "score recovery matrix". That would contain  
the most amount of score that any cell could recover, assuming  
everything went perfect!

Or I could just skip bandedness altogether, which would save me a lot  
of brain time and theoretical complication in inventing solutions to  
minimise CPU time taken to process the vast amount of data needed to  
process.

...

So my question is... is banding helpful to a general purpose Smith- 
Waterman based searcher?

I can imagine a few situations. But I'm not sure if I am imagining  
them properly. Maybe someone can correct on this? This is what I  
think right now:

1) Searching for a short protein sequence (30-ish?) within a database  
of long sequences (about 300?). I don't think bandedness will help me  
more than 3% speed increase here overall.

2) Searching for a short protein sequence (30-ish) within a database  
of short sequences (30-ish). Bandedness might help here. Not sure how  
much. Maybe 3x speed increase.

3) Searching for a large sequence in a database of short sequences,  
allowing for loose matches... I think bandnedness wouldn't help,  
maybe 3% gain at best, possibly the overhead would hurt.

4) Searching for large sequences in a database of large sequences,  
and we want tight matches. Bandedness would DEFINITELY help here! We  
may get 20x or more speed up.

I guess from my guesswork... it seems like bandedness would help me  
sometimes a lot. And othertimes hurt a bit or just be useless.

How often do bio-informatists do large sequence searching against  
databases containing many large sequences?

My leaning is towards implementing the bandedness... After all it's  
supposed to be general purpose.


--
http://elfdata.com/plugin/
"String processing, done right"




More information about the Biodevelopers mailing list