[Biococoa-dev] BCSymbolMapping

Alexander Griekspoor a.griekspoor at nki.nl
Sun Mar 13 17:33:27 EST 2005


you asked for a debate, you got one ;-) Well, discussion instead of 
debate perhaps, the others might still finish me off ;-)
Cheers,
Alex

On 13-mrt-05, at 23:18, Philipp Seibel wrote:

> ok you won ;-). I just want to finish one version ;-)
>
> Phil
>
> Am 13.03.2005 um 23:11 schrieb Alexander Griekspoor:
>
>>
>>> Am 13.03.2005 um 22:41 schrieb Alexander Griekspoor:
>>>
>>>> Hmmm, somehow I totally miss the reason the remapping. Why would it 
>>>> be leaner/faster?
>>>
>>> It would not be faster, but more flexible, because we map the 
>>> symbols to the minimal set of ints.
>> An int is 4 times the size of a char so there goes part of the 
>> optimization. And why it would be more flexible I don't see, a basic 
>> 128x128 array features all the ascii characters that you want.
>>
>>> Not only for perfomance or memory optimization.
>> All the trouble to save from using 16kb (a 128x128 char matrix)?! And 
>> it's only allocated once!
>> As far as a 500 nucleotide char array goes, it will be just as big in 
>> memory if it is: 'ACGT' as '0x00 0x01 0x02 0x03'. And what code is 
>> easier to read? Also the remapping will come with cost (not much but 
>> hey, more code is more time and more errors).
>>
>>> The next problem is the handling only through the char method, 
>>> because we need to check for uppercase or other semantic things, 
>>> where the algorithm is not really responsible for.
>> No we do not have to, because we know what char each symbol will 
>> return. The symbol templates dictate that (currently uppercase)!
>> The proposed -charArrayRepresentation (or something alike) method in 
>> the BCSequence superclass will simply itterate over the symbols and 
>> ask each one for it's symbol via the  - (unichar) symbol; method. For 
>> the otherway around we should just add an initFromCharArray or 
>> somthing to BCSequence.
>>
>>> These things should be handled by the BCSymbol stuff. For example a 
>>> 'a' and 'A' should be mapped to the same int int the dna symbol 
>>> class.
>> Well, you can store 4 variants of a char in the space of one int ;-) 
>> But again that's not an issue, see above.
>>>
>>>
>>>> What's the difference between:
>>>> char c = ('a' == 'a') ? 'I' : 'X';
>>>> and:
>>>> char c = ('0x00' == '0x00') ? 'I' : 'X';
>>>> So in the example I lend from the sample code I used previously 
>>>> already, the substitution matrix is a simple 128x128 char array and 
>>>> the characters are placed at their own spot.
>>>>
>>>>> 	    match = 1;
>>>>> 	    mismh = -1;
>>>>> 	    /* set match and mismatch weights */
>>>>> 	    for ( i = 0; i < 128 ; i++ )
>>>>> 	      for ( j = 0; j < 128 ; j++ )
>>>>> 	         if (i == j ) v[i][j] = match;
>>>>> 	         else v[i][j] = mismh;
>>>>>
>>>>> 	    v['N']['N'] = mismh;
>>>>>        	v['n']['n'] = mismh;
>>>>>         v['A']['a'] = v['a']['A'] = match;
>>>>>        	v['C']['c'] = v['c']['C'] = match;
>>>>>        	v['G']['g'] = v['g']['G'] = match;
>>>>>         v['T']['t'] = v['t']['T'] = match;
>>>>>
>>>>> So, you simply build a 128x128 char matrix using the fact that 
>>>>> chars are ints
>>>>> Next to calculate the score:
>>>>>
>>>>>  char *a = A[++i];	// character i in sequence A
>>>>>  char *b = B[++j];	// character j in sequence B
>>>>>  char *c++ = (*a == *b || isdna && v[*a][*b] == MATCHSC ) ? '|' : 
>>>>> ' ';
>>>>
>>>> So again, if we convert the sequences to char arrays why the remap? 
>>>> In the sample code above this 128x128 matrix is instantiated only 
>>>> once, takes up hardly any memory and prevents the time needed for 
>>>> the remap! So why the hassle for the few unused spots in the 
>>>> matrix? It it really worth all the trouble going from a 128x128 
>>>> array (we're talking about 16Kb of RAM!) to a 16x16 array or so?
>>>> I understand the conversion from BCSequence to char-array, but that 
>>>> can still be done with the normal chars right? Or is the idea that 
>>>> when we do the conversion we can do the remap along? I'm just 
>>>> worried that the code won't be easier to understand and much more 
>>>> error prone if we're have to remap everything all the time.
>>>> And Koen has a point, can we just add the method charRepresentation 
>>>> in BCSequence for instance, which does the translation job (and 
>>>> sequenceFromCharArray) or something. No need for a translation 
>>>> object right?
>>>> Again, perhaps I'm taking to many steps in the wrong direction at 
>>>> once...
> _______________________________________________
> Biococoa-dev mailing list
> Biococoa-dev at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biococoa-dev
>
*********************************************************
                     ** Alexander Griekspoor **
*********************************************************
               The Netherlands Cancer Institute
               Department of Tumorbiology (H4)
          Plesmanlaan 121, 1066 CX, Amsterdam
                   Tel:  + 31 20 - 512 2023
                   Fax:  + 31 20 - 512 2029
                   AIM: mekentosj at mac.com
                   E-mail: a.griekspoor at nki.nl
               Web: http://www.mekentosj.com

                             iRNAi, do you?
              http://www.mekentosj.com/irnai

*********************************************************
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 6522 bytes
Desc: not available
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20050313/6fafa7dd/attachment.bin>


More information about the Biococoa-dev mailing list