[Biococoa-dev] Weighted sequence score

Tue Mar 15 10:54:44 EST 2005

One of the things the alignment work has gotten me thinking about
implementing is a weighted sequence score.  This is for situations like
splice sites or transcription factor binding sites, where you don't tend to
have absolute sequences, but often have situations like "80% of the time,
the first base is an A, and when it's not, 15% of the time it's a G".  The
best you can do is evaluate how close a given sequence is to the ideal
sequence - ie, the best score you can get  at position 1 in the example
above is only 80%, not 100%.

The actual implementation of this doesn't seem that hard, but the details
are driving me nuts.  Three in particular:

How to provide the user a way to set up the scoring table.  My best idea
would be to require a formatted string, like this:
A:80,G:15,C:5
T:60,C:40
Etc.
Does this sound good?

The second is ambiguity.  I could just require that the queried sequence be
strict, but that seems pretty limiting.  The question then becomes how to
evaluate a situation where the first base in the example above is compared
to a purine?  It shouldn't score as well as matching A, but it shouldn't be
penalized as much as matching to an N.  I could just require the user to
supply a value for purines, but that may become a real pain for fairly
ambiguous sequences.

Non-100% value totals.  What if the user, for base 1, doesn't supply a C
value, meaning that 5% of the time it could be anything?  I could just score
it as 5%.  The problem with that is how to score  position where there's
100% defined symbols, but it's compared with an N?  My gut response there
would be to give a 25% score, but then that's penalized less than a known
base that gets the 5% score, which seems odd.

Anyway, ideas or suggestions would be welcome.  In the mean time, I'm
probably going to try to dig through BioJava and see what they do.

JT

_______________________________________________
This mind intentionally left blank