[Biococoa-dev] Weighted sequence score

Alexander Griekspoor a.griekspoor at nki.nl
Tue Mar 15 15:03:49 EST 2005


On 15-mrt-05, at 16:54, John Timmer wrote:

> One of the things the alignment work has gotten me thinking about
> implementing is a weighted sequence score.  This is for situations like
> splice sites or transcription factor binding sites, where you don't 
> tend to
> have absolute sequences, but often have situations like "80% of the 
> time,
> the first base is an A, and when it's not, 15% of the time it's a G".  
> The
> best you can do is evaluate how close a given sequence is to the ideal
> sequence - ie, the best score you can get  at position 1 in the example
> above is only 80%, not 100%.
Nice idea indeed, perfect to find consensus sequences in your sequence.
>
> The actual implementation of this doesn't seem that hard, but the 
> details
> are driving me nuts.  Three in particular:
>
> How to provide the user a way to set up the scoring table.  My best 
> idea
> would be to require a formatted string, like this:
> A:80,G:15,C:5
> T:60,C:40
> Etc.
> Does this sound good?
Hmm, not really, but I don't have a good alternative either, perhaps 
some "consensus site object".
>
> The second is ambiguity.  I could just require that the queried 
> sequence be
> strict, but that seems pretty limiting.
Absolutely because that's the idea of the thing right! If I'm not 
allowed to input W:100, I will just input A:50, T:50 right ;-) In fact 
that is how you might solve the problem...
I'll think about the other problems John...
Alex


> The question then becomes how to
> evaluate a situation where the first base in the example above is 
> compared
> to a purine?  It shouldn't score as well as matching A, but it 
> shouldn't be
> penalized as much as matching to an N.  I could just require the user 
> to
> supply a value for purines, but that may become a real pain for fairly
> ambiguous sequences.
>
> Non-100% value totals.  What if the user, for base 1, doesn't supply a 
> C
> value, meaning that 5% of the time it could be anything?  I could just 
> score
> it as 5%.  The problem with that is how to score  position where 
> there's
> 100% defined symbols, but it's compared with an N?  My gut response 
> there
> would be to give a 25% score, but then that's penalized less than a 
> known
> base that gets the 5% score, which seems odd.
>
> Anyway, ideas or suggestions would be welcome.  In the mean time, I'm
> probably going to try to dig through BioJava and see what they do.
>
> JT
>
> _______________________________________________
> This mind intentionally left blank
>
>
> _______________________________________________
> Biococoa-dev mailing list
> Biococoa-dev at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biococoa-dev
>
>
*********************************************************
                     ** Alexander Griekspoor **
*********************************************************
              The Netherlands Cancer Institute
              Department of Tumorbiology (H4)
         Plesmanlaan 121, 1066 CX, Amsterdam
                    Tel:  + 31 20 - 512 2023
                    Fax:  + 31 20 - 512 2029
                   AIM: mekentosj at mac.com
                    E-mail: a.griekspoor at nki.nl
                Web: http://www.mekentosj.com

	Claiming that the Macintosh is inferior to Windows
	because most people use Windows, is like saying
	that all other restaurants serve food that is
	inferior to McDonalds

*********************************************************


*********************************************************
                     ** Alexander Griekspoor **
*********************************************************
               The Netherlands Cancer Institute
               Department of Tumorbiology (H4)
          Plesmanlaan 121, 1066 CX, Amsterdam
                    Tel:  + 31 20 - 512 2023
                   Fax:  + 31 20 - 512 2029
                   AIM: mekentosj at mac.com
                  E-mail: a.griekspoor at nki.nl
               Web: http://www.mekentosj.com

               4Peaks - For Peaks, Four Peaks.
        2004 Winner of the Apple Design Awards
                Best Mac OS X Student Product
              http://www.mekentosj.com/4peaks

*********************************************************




More information about the Biococoa-dev mailing list