[BiO BB] Looking for researcher, to assist on blast-like invention

Tue Feb 12 11:55:30 EST 2008

Hi,
We do a lot of searching of protein databases, searching for distant  
homologs.
If we send you protein sequences, can you search a protein database  
(NR)?

Chris
On Feb 11, 2008, at 3:56 PM, Theodore H. Smith wrote:

>
> On 11 Feb 2008, at 22:28, Ryan Golhar wrote:
>
>> Why don't you write up a paper describing the algorithm in detail and
>> submit it to a bioinformatics journal?  And, why not make the
>> executable
>> available with documentation so that people can download it and try  
>> it
>> out for themselves.
>>
>> Do you have any test cases that show it runs faster/better than  
>> BLAST?
>> Describe them and make them available.
>
> The first thing I'd need to do is make a good test. I'm not sure what
> constitutes "a good test", in this case.
>
> How big should the databanks be to make the test reasonable? Is
> randomly generated data good enough, or is a randomly selected sample
> better. If a sample is better, how large a dataset must I gather to do
> the test.
>
> Perhaps certain settings make my algorithm work better or worse
> relative to BLAST. But then how do I know which settings are more
> likely to be used and which aren't?
>
> I think someone who uses BLAST frequently, and knows it well from a
> user's perspective... might have a better feel for creating a test
> than I might.
>
> The worst thing that could happen is I make a test, which is unfairly
> prejudiced to my algorithm :) The next thing that would happen is
> people would see my test has "suspiciously good" results, and... be
> annoyed about that, and lose interest, even if it were an innocent
> mistake on my end. I'd rather avoid that sort of mistake by getting a
> knowledged eye in the designing of a test!
>
> Like I said, I haven't gotten all the code in C++ yet. I've got a
> framework in C++ already, I mean I know how to write C++. And I know
> what to do, as I've written it in a proto-typing language.
>
> The C++ version will come soon, though.
>
>> Theodore H. Smith wrote:
>>> Hi everyone,
>>>
>>> So I've been working, on and off, on this algorithm for quite a  
>>> while
>>> now. It's basically an invention of mine. It is a "blast-like"
>>> algorithm, in that it does "Fuzzy lookup" operations across a
>>> database
>>> of letters. I am designing this algorithm to be useful for bio-
>>> informatics, this is the main field I am initially targetting.
>>>
>>> The database will be filled with protein sequences, and the search
>>> across the database will be another protein sequence. The algorithm
>>> has a "scoring matrix", which can accept different protein
>>> replacement
>>> scores. The cost of inserting letters (protein letters) can be
>>> configured also.
>>>
>>> In this sense, it's no different to Smith-Waterman. The same input,
>>> the same output!
>>>
>>> The real difference from Smith-Waterman, is it's speed. My algorithm
>>> will be hugely faster. This is because I use many techniques to  
>>> avoid
>>> processing unnecessary parts of the Smith-Waterman matrix.
>>>
>>> I also use many tricks to reuse computations across various  
>>> proteins.
>>> For example, the matrix for protein "ABCDE", is identical, at first
>>> anyhow, for the matrix for "ABCDEFG". This means if I have both
>>> proteins "ABCDE", and "ABCDEFG" in my protein database, I can test
>>> both of them against the search query, in almost half the time. My
>>> algorithm also runs in logarithmic-time with respect to the size of
>>> the database. Basically, bigger databases run disproportionately
>>> faster.
>>>
>>> I want to turn this algorithm, into something useful for people. My
>>> first challenge here, is to answer the question "is this algorithm
>>> faster, or better than BLAST". If it is not faster, my algorithm
>>> basically has little use. But I have good hopes it will be faster! I
>>> am very good with these sort of things, you see :) Speed is my
>>> strong-
>>> point.
>>>
>>> Currently, I do not know about the speed, because I haven't
>>> implemented a C++ version of my algorithm, or a good speed testing
>>> framework.
>>>
>>> I do however know that my algorithm is more accurate than BLAST,
>>> because it is just as accurate as SSEARCH, as mine uses the Smith-
>>> Waterman algorithm. Whereas BLAST uses a heuristic, intelligent
>>> guess-
>>> work basically. A fine heuristic, but still a heuristic. Mine is
>>> methodological, not heuristic based.
>>>
>>> So here is what I am looking for!
>>>
>>> I am hoping, that someone in the field will be able to offer me
>>> guidance, interest, enthusiasm, suggestions and maybe even do some
>>> testing for me.
>>>
>>> Perhaps a student doing a bio-informatics related degree, who would
>>> like to write a paper on an alternative way of processing protein
>>> databases. My invention could be an interesting subject for a paper.
>>>
>>> Or perhaps a researcher who just has an interest in these sort of
>>> things! Perhaps a researcher who feels there must be a better way of
>>> doing these things. Or anyone really in this field with the time and
>>> interest, and feels helping me could help him (or her) too in some
>>> way.
>>>
>>> I'd like someone I can ask a lot of questions to, and show my
>>> software
>>> to, and explain my hopes what I can achieve with it.
>>>
>>> Basically, my first question to you, would be "how would I set this
>>> up
>>> to be useful for someone", and "how would I test it's usefulness,
>>> what
>>> would you need to know about my algorithm that you would decide to
>>> use
>>> it over blast"
>>>
>>> It's sort of a vague question from me, like "what do you need me to
>>> do", but... well that's where I am right now. Sort of a bit on the
>>> outside hoping someone on the inside will show me something.
>>>
>>> So it's an opportunity to tell me what you want, basically!! Tell  
>>> me,
>>> and I might just make it.
>>>
>>> Who knows? Maybe one day in a few years time, everyone will be using
>>> this "ElfDataFuzzy" algorithm that I invented, instead of BLAST! You
>>> might be part of something.
>>>
>>> Thanks to anyone who replies!
>>>
>>> --
>>> http://elfdata.com/plugin/
>>> "String processing, done right"
>>>
>>>
>>>
>>> _______________________________________________
>>> BBB mailing list
>>> BBB at bioinformatics.org
>>> http://www.bioinformatics.org/mailman/listinfo/bbb
>>>
>>>
>>
>>
>> _______________________________________________
>> BBB mailing list
>> BBB at bioinformatics.org
>> http://www.bioinformatics.org/mailman/listinfo/bbb
>
> --
> http://elfdata.com/plugin/
> "String processing, done right"
>
>
>
> _______________________________________________
> BBB mailing list
> BBB at bioinformatics.org
> http://www.bioinformatics.org/mailman/listinfo/bbb

Chris Upton Ph.D.                                   Associate Professor
Biochemistry and Microbiology             Tel. 250-721-6507
University of Victoria                                Fax  250-721-8855
P.O. Box 3055 STN CSC
Victoria, BC  V8W 3P6
Canada

web.uvic.ca/~cupton
www.virology.ca
www.biodirectory.com/uptons_blog.html