[Biococoa-dev] Symbol mapping for optimization

Sat Mar 12 03:18:36 EST 2005

>@Charles: Perhaps we could discuss your symbol to int mapping in more detail, i didn't get the idea. ;-)
>
>Phil

Thanks for giving me a second chance! I reread my email, which I realized is quite obfuscated... I got carried away. I think what I am trying to say is basically what everybody has in mind, but I am trying to make it more formal, and I want to propose a specific design to handle the task.

So I will start with the only clear lines in my previous email...

We want something like this:

     objects ------> C ------> algorithm -------> C -------> Objects

The first and last arrow are the 'translation' steps. The translation takes the data structure in the objects and make C arrays, that will be well adapted for performance. In what we are doing with BioCocoa, we will mostly need to provide a translation for symbols, and decide on the type to use to replace BCSymbol. That type could be int, but maybe char would be more efficient with 1 byte instead of 4. So I changed my mind: let's map BCSymbols to chars, which I will sometimes write BCSymbol<-->char

I think we need to keep the following pieces REALLY separate:

(1) the BCSymbol classes and their cousins, BCAbstractSequence et al., BCScoreMatrix, BCSymbolSet... These are objects that only know about Symbols as objects, not as chars (at least they don't know about the char used for mapping).

(2) the translator class: this is the only class that knows how to map a symbol into a char.

(3) the algorithm class: the algorithm code has no idea about the biology. It takes arrays of chars and does its job. Note that this will probably apply to alignments, but could apply to other situations where we need to increase the performance, and that can manipulate arrays of chars and not care about their exact meaning.

I think what is really not obvious at first, and that can be confusing, is the separation between (1) and (2). It seems obvious that a char should be the char corresponding to the BCSymbol, for instance base 'A' should be mapped to char 'A'. Maybe we will do that initially but we want to be able to modify that in the future, or even to have more dynamic mapping depending on the context. For instance, we might find later that mapping the bases ATGC to teh chars '0x00-0x01-0x02-0x03' is much better than mapping to the 'ATGC' chars, because we don't have useless chars in between each used char. We then just have to modify the code in (2), and probably only one or two lines of code, to propagate whatever optimization we make in the translation to the whole framework.

If we don't keep code in (2) separate, and instead spread it in the different classes of (1) and (3), we will have some problems in maintenance and will slow down future evolutions. For instamce, if we let the algorithm (3) decide, then we have to rewrite the same code for a different algorithm, and then modify everything if we change our mind.

Finally, we could offer different mappings, BCSymbol<-->char or BCSymbol<-->int, depending which is best for a given algorithm.

At this point, I hope you see why having a separate translator class could make sense. Now, next step: its implementation. In the implementation of the translator, I can see how BCSymbolSet would be very useful. I think each BCSymbolSet could define a different mapping. For instance, a symbol set with ATGC would result in a certain mapping, where A is mapped to a certain char XXXX. But if the symbol set is ATGCBVHD, then symbol A could well be mapped to a different char, e.g. not XXXX but YYYY. Thus instead of having a fixed mapping BCSymbol <--> char, we could have a more dynamic mapping only dependent on a symbol set. This way, for instance, we could decide to always use the smaller possible matrix for scores, e.g. 4x4 for a symbol set of 4 symbols.

Symbol set are easy to define before starting an alignement, and should be easy to define before any algorithm where BCSymbol<-->char mapping makes sense. In the case of alignement, we would do the following:
* Define a BCSymbolSet that covers the sequences to align, e.g. union of the symbol sets of the sequences
* Use that symbol set to instantiate a new translator, e.g. 'translatorWithSymbolSet:'
* Call the translator to translate the BCSequences --> *char
* Call the translator to translate the BCScoreMatrix --> **int (the indexes will be chars cast to ints)
* Run the algorithm using only the chars
* Call the translator to translate back the chars into sequences et al.

(note about int**: you are right, Phil, that *int are faster to access, but you can have both **int and *int at the same time, because if you create a matrix a[][] as one block in memory, then you can use a[0][] = an *int with single index access, when needed).

does this email make more sense??

Thanks for reading it all :-)
These were my 4 cents.

Charles

NB: we may use the name 'mapper' instead of 'translator'...
-- 
Help science go fast forward:
http://cmgm.stanford.edu/~cparnot/xgrid-stanford/

Charles Parnot
charles.parnot at stanford.edu

Room  B157 in Beckman Center
279, Campus Drive
Stanford University
Stanford, CA 94305 (USA)

Tel +1 650 725 7754
Fax +1 650 725 8021