[Biococoa-dev] Weighted sequence score
Alexander Griekspoor
a.griekspoor at nki.nl
Tue Mar 15 15:03:49 EST 2005
On 15-mrt-05, at 16:54, John Timmer wrote:
> One of the things the alignment work has gotten me thinking about
> implementing is a weighted sequence score. This is for situations like
> splice sites or transcription factor binding sites, where you don't
> tend to
> have absolute sequences, but often have situations like "80% of the
> time,
> the first base is an A, and when it's not, 15% of the time it's a G".
> The
> best you can do is evaluate how close a given sequence is to the ideal
> sequence - ie, the best score you can get at position 1 in the example
> above is only 80%, not 100%.
Nice idea indeed, perfect to find consensus sequences in your sequence.
>
> The actual implementation of this doesn't seem that hard, but the
> details
> are driving me nuts. Three in particular:
>
> How to provide the user a way to set up the scoring table. My best
> idea
> would be to require a formatted string, like this:
> A:80,G:15,C:5
> T:60,C:40
> Etc.
> Does this sound good?
Hmm, not really, but I don't have a good alternative either, perhaps
some "consensus site object".
>
> The second is ambiguity. I could just require that the queried
> sequence be
> strict, but that seems pretty limiting.
Absolutely because that's the idea of the thing right! If I'm not
allowed to input W:100, I will just input A:50, T:50 right ;-) In fact
that is how you might solve the problem...
I'll think about the other problems John...
Alex
> The question then becomes how to
> evaluate a situation where the first base in the example above is
> compared
> to a purine? It shouldn't score as well as matching A, but it
> shouldn't be
> penalized as much as matching to an N. I could just require the user
> to
> supply a value for purines, but that may become a real pain for fairly
> ambiguous sequences.
>
> Non-100% value totals. What if the user, for base 1, doesn't supply a
> C
> value, meaning that 5% of the time it could be anything? I could just
> score
> it as 5%. The problem with that is how to score position where
> there's
> 100% defined symbols, but it's compared with an N? My gut response
> there
> would be to give a 25% score, but then that's penalized less than a
> known
> base that gets the 5% score, which seems odd.
>
> Anyway, ideas or suggestions would be welcome. In the mean time, I'm
> probably going to try to dig through BioJava and see what they do.
>
> JT
>
> _______________________________________________
> This mind intentionally left blank
>
>
> _______________________________________________
> Biococoa-dev mailing list
> Biococoa-dev at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biococoa-dev
>
>
*********************************************************
** Alexander Griekspoor **
*********************************************************
The Netherlands Cancer Institute
Department of Tumorbiology (H4)
Plesmanlaan 121, 1066 CX, Amsterdam
Tel: + 31 20 - 512 2023
Fax: + 31 20 - 512 2029
AIM: mekentosj at mac.com
E-mail: a.griekspoor at nki.nl
Web: http://www.mekentosj.com
Claiming that the Macintosh is inferior to Windows
because most people use Windows, is like saying
that all other restaurants serve food that is
inferior to McDonalds
*********************************************************
*********************************************************
** Alexander Griekspoor **
*********************************************************
The Netherlands Cancer Institute
Department of Tumorbiology (H4)
Plesmanlaan 121, 1066 CX, Amsterdam
Tel: + 31 20 - 512 2023
Fax: + 31 20 - 512 2029
AIM: mekentosj at mac.com
E-mail: a.griekspoor at nki.nl
Web: http://www.mekentosj.com
4Peaks - For Peaks, Four Peaks.
2004 Winner of the Apple Design Awards
Best Mac OS X Student Product
http://www.mekentosj.com/4peaks
*********************************************************
More information about the Biococoa-dev
mailing list