[Biococoa-dev] BCSymbolMapping
Alexander Griekspoor
a.griekspoor at nki.nl
Sun Mar 13 17:11:25 EST 2005
> Am 13.03.2005 um 22:41 schrieb Alexander Griekspoor:
>
>> Hmmm, somehow I totally miss the reason the remapping. Why would it
>> be leaner/faster?
>
> It would not be faster, but more flexible, because we map the symbols
> to the minimal set of ints.
An int is 4 times the size of a char so there goes part of the
optimization. And why it would be more flexible I don't see, a basic
128x128 array features all the ascii characters that you want.
> Not only for perfomance or memory optimization.
All the trouble to save from using 16kb (a 128x128 char matrix)?! And
it's only allocated once!
As far as a 500 nucleotide char array goes, it will be just as big in
memory if it is: 'ACGT' as '0x00 0x01 0x02 0x03'. And what code is
easier to read? Also the remapping will come with cost (not much but
hey, more code is more time and more errors).
> The next problem is the handling only through the char method, because
> we need to check for uppercase or other semantic things, where the
> algorithm is not really responsible for.
No we do not have to, because we know what char each symbol will
return. The symbol templates dictate that (currently uppercase)!
The proposed -charArrayRepresentation (or something alike) method in
the BCSequence superclass will simply itterate over the symbols and ask
each one for it's symbol via the - (unichar) symbol; method. For the
otherway around we should just add an initFromCharArray or somthing to
BCSequence.
> These things should be handled by the BCSymbol stuff. For example a
> 'a' and 'A' should be mapped to the same int int the dna symbol class.
Well, you can store 4 variants of a char in the space of one int ;-)
But again that's not an issue, see above.
>
>
>> What's the difference between:
>> char c = ('a' == 'a') ? 'I' : 'X';
>> and:
>> char c = ('0x00' == '0x00') ? 'I' : 'X';
>> So in the example I lend from the sample code I used previously
>> already, the substitution matrix is a simple 128x128 char array and
>> the characters are placed at their own spot.
>>
>>> match = 1;
>>> mismh = -1;
>>> /* set match and mismatch weights */
>>> for ( i = 0; i < 128 ; i++ )
>>> for ( j = 0; j < 128 ; j++ )
>>> if (i == j ) v[i][j] = match;
>>> else v[i][j] = mismh;
>>>
>>> v['N']['N'] = mismh;
>>> v['n']['n'] = mismh;
>>> v['A']['a'] = v['a']['A'] = match;
>>> v['C']['c'] = v['c']['C'] = match;
>>> v['G']['g'] = v['g']['G'] = match;
>>> v['T']['t'] = v['t']['T'] = match;
>>>
>>> So, you simply build a 128x128 char matrix using the fact that chars
>>> are ints
>>> Next to calculate the score:
>>>
>>> char *a = A[++i]; // character i in sequence A
>>> char *b = B[++j]; // character j in sequence B
>>> char *c++ = (*a == *b || isdna && v[*a][*b] == MATCHSC ) ? '|' : '
>>> ';
>>
>> So again, if we convert the sequences to char arrays why the remap?
>> In the sample code above this 128x128 matrix is instantiated only
>> once, takes up hardly any memory and prevents the time needed for the
>> remap! So why the hassle for the few unused spots in the matrix? It
>> it really worth all the trouble going from a 128x128 array (we're
>> talking about 16Kb of RAM!) to a 16x16 array or so?
>> I understand the conversion from BCSequence to char-array, but that
>> can still be done with the normal chars right? Or is the idea that
>> when we do the conversion we can do the remap along? I'm just worried
>> that the code won't be easier to understand and much more error prone
>> if we're have to remap everything all the time.
>> And Koen has a point, can we just add the method charRepresentation
>> in BCSequence for instance, which does the translation job (and
>> sequenceFromCharArray) or something. No need for a translation object
>> right?
>> Again, perhaps I'm taking to many steps in the wrong direction at
>> once...
>
>
>
> Phil
> _______________________________________________
> Biococoa-dev mailing list
> Biococoa-dev at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biococoa-dev
>
**************************************************************
** Alexander Griekspoor **
**************************************************************
The Netherlands Cancer Institute
Department of Tumorbiology (H4)
Plesmanlaan 121, 1066 CX, Amsterdam
Tel: + 31 20 - 512 2023
Fax: + 31 20 - 512 2029
AIM: mekentosj at mac.com
E-mail: a.griekspoor at nki.nl
Web: http://www.mekentosj.com
MacOS X: The power of UNIX with the simplicity of the Mac
***************************************************************
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 6203 bytes
Desc: not available
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20050313/0f4f3e47/attachment.bin>
More information about the Biococoa-dev
mailing list