[Biococoa-dev] Weighted sequence score
Philipp Seibel
biococoa at bioworxx.com
Tue Mar 15 17:06:02 EST 2005
Am 15.03.2005 um 21:03 schrieb Alexander Griekspoor:
> On 15-mrt-05, at 16:54, John Timmer wrote:
>
>> One of the things the alignment work has gotten me thinking about
>> implementing is a weighted sequence score.
Its more a weighted base score, isn't it ?
>> This is for situations like
>> splice sites or transcription factor binding sites, where you don't
>> tend to
>> have absolute sequences, but often have situations like "80% of the
>> time,
>> the first base is an A, and when it's not, 15% of the time it's a G".
>> The
>> best you can do is evaluate how close a given sequence is to the ideal
>> sequence - ie, the best score you can get at position 1 in the
>> example
>> above is only 80%, not 100%.
> Nice idea indeed, perfect to find consensus sequences in your
> sequence.S
A Profile is nothing else than a bunch of sequences represented by a
"probabilistic" model. So if you look at it, like
80% of my sequences have at a specific position an A and 15% of them
have a G, it will bring you to a convenient method like:
+ (BCSequenceProfile *)profileWithSequenceArray:(NSArray *)array;
>>
>> The actual implementation of this doesn't seem that hard, but the
>> details
>> are driving me nuts. Three in particular:
>>
>> How to provide the user a way to set up the scoring table. My best
>> idea
>> would be to require a formatted string, like this:
>> A:80,G:15,C:5
>> T:60,C:40
>> Etc.
>> Does this sound good?
> Hmm, not really, but I don't have a good alternative either, perhaps
> some "consensus site object".
Don't think we will need it, because you can construct a profile like
this:
sequenceA : AAAATATAGC
sequenceB : AAATATATAT
sequenceC: AAATTATATT
with the previous described method
A: 100
A: 100
A: 100
A: 33 T: 66
A: 33 T: 66
....
Of course profiles could have a convenient method like this:
+ (BCSequenceProfile *)profileWithAlignment:(BCSequenceAlignment
*)alignment;
Phil
>>
>> The second is ambiguity. I could just require that the queried
>> sequence be
>> strict, but that seems pretty limiting.
> Absolutely because that's the idea of the thing right! If I'm not
> allowed to input W:100, I will just input A:50, T:50 right ;-) In fact
> that is how you might solve the problem...
> I'll think about the other problems John...
> Alex
>
>
>> The question then becomes how to
>> evaluate a situation where the first base in the example above is
>> compared
>> to a purine? It shouldn't score as well as matching A, but it
>> shouldn't be
>> penalized as much as matching to an N. I could just require the user
>> to
>> supply a value for purines, but that may become a real pain for fairly
>> ambiguous sequences.
>>
>> Non-100% value totals. What if the user, for base 1, doesn't supply
>> a C
>> value, meaning that 5% of the time it could be anything? I could
>> just score
>> it as 5%. The problem with that is how to score position where
>> there's
>> 100% defined symbols, but it's compared with an N? My gut response
>> there
>> would be to give a 25% score, but then that's penalized less than a
>> known
>> base that gets the 5% score, which seems odd.
>>
>> Anyway, ideas or suggestions would be welcome. In the mean time, I'm
>> probably going to try to dig through BioJava and see what they do.
>>
>> JT
>>
>> _______________________________________________
>> This mind intentionally left blank
>>
>>
>> _______________________________________________
>> Biococoa-dev mailing list
>> Biococoa-dev at bioinformatics.org
>> https://bioinformatics.org/mailman/listinfo/biococoa-dev
>>
>>
> *********************************************************
> ** Alexander Griekspoor **
> *********************************************************
> The Netherlands Cancer Institute
> Department of Tumorbiology (H4)
> Plesmanlaan 121, 1066 CX, Amsterdam
> Tel: + 31 20 - 512 2023
> Fax: + 31 20 - 512 2029
> AIM: mekentosj at mac.com
> E-mail: a.griekspoor at nki.nl
> Web: http://www.mekentosj.com
>
> Claiming that the Macintosh is inferior to Windows
> because most people use Windows, is like saying
> that all other restaurants serve food that is
> inferior to McDonalds
>
> *********************************************************
>
>
> *********************************************************
> ** Alexander Griekspoor **
> *********************************************************
> The Netherlands Cancer Institute
> Department of Tumorbiology (H4)
> Plesmanlaan 121, 1066 CX, Amsterdam
> Tel: + 31 20 - 512 2023
> Fax: + 31 20 - 512 2029
> AIM: mekentosj at mac.com
> E-mail: a.griekspoor at nki.nl
> Web: http://www.mekentosj.com
>
> 4Peaks - For Peaks, Four Peaks.
> 2004 Winner of the Apple Design Awards
> Best Mac OS X Student Product
> http://www.mekentosj.com/4peaks
>
> *********************************************************
>
> _______________________________________________
> Biococoa-dev mailing list
> Biococoa-dev at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biococoa-dev
>
>
More information about the Biococoa-dev
mailing list