[Biococoa-dev] Weighted sequence score

John Timmer jtimmer at bellatlantic.net
Tue Mar 15 10:54:44 EST 2005

One of the things the alignment work has gotten me thinking about
implementing is a weighted sequence score.  This is for situations like
splice sites or transcription factor binding sites, where you don't tend to
have absolute sequences, but often have situations like "80% of the time,
the first base is an A, and when it's not, 15% of the time it's a G".  The
best you can do is evaluate how close a given sequence is to the ideal
sequence - ie, the best score you can get  at position 1 in the example
above is only 80%, not 100%.

The actual implementation of this doesn't seem that hard, but the details
are driving me nuts.  Three in particular:

How to provide the user a way to set up the scoring table.  My best idea
would be to require a formatted string, like this:
Does this sound good?

The second is ambiguity.  I could just require that the queried sequence be
strict, but that seems pretty limiting.  The question then becomes how to
evaluate a situation where the first base in the example above is compared
to a purine?  It shouldn't score as well as matching A, but it shouldn't be
penalized as much as matching to an N.  I could just require the user to
supply a value for purines, but that may become a real pain for fairly
ambiguous sequences.

Non-100% value totals.  What if the user, for base 1, doesn't supply a C
value, meaning that 5% of the time it could be anything?  I could just score
it as 5%.  The problem with that is how to score  position where there's
100% defined symbols, but it's compared with an N?  My gut response there
would be to give a 25% score, but then that's penalized less than a known
base that gets the 5% score, which seems odd.

Anyway, ideas or suggestions would be welcome.  In the mean time, I'm
probably going to try to dig through BioJava and see what they do.


This mind intentionally left blank

More information about the Biococoa-dev mailing list