[Biococoa-dev] starting BCAlignment

Charles PARNOT charles.parnot at stanford.edu
Fri Mar 11 01:21:18 EST 2005

>I think it's a quite good approach, but we have to decide wheter we want to "ask" the matrix with two BCSymbols or just with chars. Take a look at my recent implementation of the scoring matrix. It's perhaps slower than this one, but more comfortable. I think we just have to test the performance, when we've done the first algorithm.

I was thinking about the alignement implementation while driving to the day care (!), and after looking at your code, I am so delighted to see that I had something very similar in mind. What you are doing is mapping symbols to int in your scoring matrix. My thought about it was the same, but using... symbol sets, of course. Which you probably had in mind, in fact, given the question you asked about 'allObjects'.

The current implementation is still very much OO, which is good. Of course, as a result, it might be slow, with the overhead from the substituteSymbol:forSymbol, that scans the NSArray, and accessing the symbols through the sequence objects, but Shark will tell.

Then, if we need to optimize, there is an obvious(?) path, and here is how we could use symbol sets:

* The sequences you need to align define a SymbolSet, probaby the union of the symbol sets of the sequences

* That instance of the BCSymbolSet classmight be then able to provide a perfect and reproducible bijection between that set of symbols and int values
    --> e.g. '(int)equivalentIntValueForSymbol:(BCSymbol *)aSymbol'
And ONLY the BCSymbolSet class can decide on that bijection.
One way could be to simply sort the symbols alphabetically.
So again,   one symbol in one SymbolSet = one int
(very similar to what you did in BCScoreMatrix)

* That bijection between symbols and int can be used to:
	- translate sequences into int array
	- translate the dictionary in the score matrix into int**

* Then the alignment algorithm manipulates only int, and is completely sequence-agnostic

* After alignement, everything is translated back to symbols to generate a BCAlignement object

Nothing really original:
objects ------> C ------> algorithm -------> C -------> Objects

The first and last arrow are the 'translation' steps. To avoid problems, that translation should be all in one place, which means all in one class. For instance, BCSymbolSet (and not BCScoreMatrix). And then, BCSymbolSet becomes really important in the framework. In the end, also, a user could create exotic symbols, exotic sequences, and exotic score matrices, and still use the same algorithms.

A final comment about the scrore matrix in that design: because BCSymbolSet is in charge of the int<-->BCSymbol translation, the score matrix has to be defined as a dictionary, like John suggested. Such a dictionary could use the symbols as the key, for instance to get the score of substitution of symbolA for symbolB:
   NSNUmber *score = [[scoreDictionary objectForKey:symbolA] objectForKey:symbolB];
(key being copied for dictionary, BCSymbol has to be immutable with that design... or we could use the string representation)
That makes matrices difficult to define programatically, but easier through plist.
OK, maybe something else, but at something fully OO and very readable.

Final question: how do gaps fit in the matrix score thing? Is there a score for a gap/symbol? Maybe gaps sould be excluded from the symbol <--> int conversion? They would be a special case, with some special scoring schemes, like gap-open, gap-extension,...?

Well, these were my 2 cents ;-)

Help science go fast forward:

Charles Parnot
charles.parnot at stanford.edu

Room  B157 in Beckman Center
279, Campus Drive
Stanford University
Stanford, CA 94305 (USA)

Tel +1 650 725 7754
Fax +1 650 725 8021

More information about the Biococoa-dev mailing list