[Biococoa-dev] BCSymbolMapping

Philipp Seibel biococoa at bioworxx.com
Sun Mar 13 16:55:02 EST 2005


Am 13.03.2005 um 22:41 schrieb Alexander Griekspoor:

> Hmmm, somehow I totally miss the reason the remapping. Why would it be 
> leaner/faster?

It would not be faster, but more flexible, because we map the symbols 
to the minimal set of ints. Not only for perfomance or memory 
optimization.
The next problem is the handling only through the char method, because 
we need to check for uppercase or other semantic things, where the 
algorithm is not really responsible for.
These things should be handled by the BCSymbol stuff. For example a 'a' 
and 'A' should be mapped to the same int int the dna symbol class.




> What's the difference between:
> char c = ('a' == 'a') ? 'I' : 'X';
> and:
> char c = ('0x00' == '0x00') ? 'I' : 'X';
> So in the example I lend from the sample code I used previously 
> already, the substitution matrix is a simple 128x128 char array and 
> the characters are placed at their own spot.
>
>> 	    match = 1;
>> 	    mismh = -1;
>> 	    /* set match and mismatch weights */
>> 	    for ( i = 0; i < 128 ; i++ )
>> 	      for ( j = 0; j < 128 ; j++ )
>> 	         if (i == j ) v[i][j] = match;
>> 	         else v[i][j] = mismh;
>>
>> 	    v['N']['N'] = mismh;
>>        	v['n']['n'] = mismh;
>>         v['A']['a'] = v['a']['A'] = match;
>>        	v['C']['c'] = v['c']['C'] = match;
>>        	v['G']['g'] = v['g']['G'] = match;
>>         v['T']['t'] = v['t']['T'] = match;
>>
>> So, you simply build a 128x128 char matrix using the fact that chars 
>> are ints
>> Next to calculate the score:
>>
>>  char *a = A[++i];	// character i in sequence A
>>  char *b = B[++j];	// character j in sequence B
>>  char *c++ = (*a == *b || isdna && v[*a][*b] == MATCHSC ) ? '|' : ' ';
>
> So again, if we convert the sequences to char arrays why the remap? In 
> the sample code above this 128x128 matrix is instantiated only once, 
> takes up hardly any memory and prevents the time needed for the remap! 
> So why the hassle for the few unused spots in the matrix? It it really 
> worth all the trouble going from a 128x128 array (we're talking about 
> 16Kb of RAM!) to a 16x16 array or so?
> I understand the conversion from BCSequence to char-array, but that 
> can still be done with the normal chars right? Or is the idea that 
> when we do the conversion we can do the remap along? I'm just worried 
> that the code won't be easier to understand and much more error prone 
> if we're have to remap everything all the time.
> And Koen has a point, can we just add the method charRepresentation in 
> BCSequence for instance, which does the translation job (and 
> sequenceFromCharArray) or something. No need for a translation object 
> right?
> Again, perhaps I'm taking to many steps in the wrong direction at 
> once...



Phil
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 4119 bytes
Desc: not available
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20050313/335b381f/attachment.bin>


More information about the Biococoa-dev mailing list