[Biococoa-dev] Weighted sequence score

Philipp Seibel biococoa at bioworxx.com
Tue Mar 15 11:16:37 EST 2005

Am 15.03.2005 um 16:54 schrieb John Timmer:

> One of the things the alignment work has gotten me thinking about
> implementing is a weighted sequence score.  This is for situations like
> splice sites or transcription factor binding sites, where you don't 
> tend to
> have absolute sequences, but often have situations like "80% of the 
> time,
> the first base is an A, and when it's not, 15% of the time it's a G".  
> The
> best you can do is evaluate how close a given sequence is to the ideal
> sequence - ie, the best score you can get  at position 1 in the example
> above is only 80%, not 100%.

Seems to be something like sequence profiles, am i right ?
You want to now how good a sequence fits to a profile of other 
sequences, which is made for example out of an alignment ?
Thats a very good thing, id like to have this as well. Could be used 
for sequence searching, or phylogenetics.


> The actual implementation of this doesn't seem that hard, but the 
> details
> are driving me nuts.  Three in particular:
> How to provide the user a way to set up the scoring table.  My best 
> idea
> would be to require a formatted string, like this:
> A:80,G:15,C:5
> T:60,C:40
> Etc.
> Does this sound good?
> The second is ambiguity.  I could just require that the queried 
> sequence be
> strict, but that seems pretty limiting.  The question then becomes how 
> to
> evaluate a situation where the first base in the example above is 
> compared
> to a purine?  It shouldn't score as well as matching A, but it 
> shouldn't be
> penalized as much as matching to an N.  I could just require the user 
> to
> supply a value for purines, but that may become a real pain for fairly
> ambiguous sequences.
> Non-100% value totals.  What if the user, for base 1, doesn't supply a 
> C
> value, meaning that 5% of the time it could be anything?  I could just 
> score
> it as 5%.  The problem with that is how to score  position where 
> there's
> 100% defined symbols, but it's compared with an N?  My gut response 
> there
> would be to give a 25% score, but then that's penalized less than a 
> known
> base that gets the 5% score, which seems odd.
> Anyway, ideas or suggestions would be welcome.  In the mean time, I'm
> probably going to try to dig through BioJava and see what they do.
> JT
> _______________________________________________
> This mind intentionally left blank
> _______________________________________________
> Biococoa-dev mailing list
> Biococoa-dev at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biococoa-dev

More information about the Biococoa-dev mailing list