[Biococoa-dev] starting BCAlignment

Philipp Seibel biococoa at bioworxx.com
Fri Mar 11 03:15:01 EST 2005

Am 11.03.2005 um 07:21 schrieb Charles PARNOT:

>> I think it's a quite good approach, but we have to decide wheter we 
>> want to "ask" the matrix with two BCSymbols or just with chars. Take 
>> a look at my recent implementation of the scoring matrix. It's 
>> perhaps slower than this one, but more comfortable. I think we just 
>> have to test the performance, when we've done the first algorithm.
> I was thinking about the alignement implementation while driving to 
> the day care (!), and after looking at your code, I am so delighted to 
> see that I had something very similar in mind. What you are doing is 
> mapping symbols to int in your scoring matrix. My thought about it was 
> the same, but using... symbol sets, of course. Which you probably had 
> in mind, in fact, given the question you asked about 'allObjects'.
> The current implementation is still very much OO, which is good. Of 
> course, as a result, it might be slow, with the overhead from the 
> substituteSymbol:forSymbol, that scans the NSArray, and accessing the 
> symbols through the sequence objects, but Shark will tell.
> Then, if we need to optimize, there is an obvious(?) path, and here is 
> how we could use symbol sets:
> * The sequences you need to align define a SymbolSet, probaby the 
> union of the symbol sets of the sequences
> * That instance of the BCSymbolSet classmight be then able to provide 
> a perfect and reproducible bijection between that set of symbols and 
> int values
>     --> e.g. '(int)equivalentIntValueForSymbol:(BCSymbol *)aSymbol'
> And ONLY the BCSymbolSet class can decide on that bijection.
> One way could be to simply sort the symbols alphabetically.
> So again,   one symbol in one SymbolSet = one int
> (very similar to what you did in BCScoreMatrix)
> * That bijection between symbols and int can be used to:
> 	- translate sequences into int array
> 	- translate the dictionary in the score matrix into int**

or int* as i mentioned before ;-).

> * Then the alignment algorithm manipulates only int, and is completely 
> sequence-agnostic

Thats what i want it to be.

> * After alignement, everything is translated back to symbols to 
> generate a BCAlignement object
> Nothing really original:
> objects ------> C ------> algorithm -------> C -------> Objects

I think that will be the approach to make it fast.

> The first and last arrow are the 'translation' steps. To avoid 
> problems, that translation should be all in one place, which means all 
> in one class. For instance, BCSymbolSet (and not BCScoreMatrix). And 
> then, BCSymbolSet becomes really important in the framework. In the 
> end, also, a user could create exotic symbols, exotic sequences, and 
> exotic score matrices, and still use the same algorithms.
> A final comment about the scrore matrix in that design: because 
> BCSymbolSet is in charge of the int<-->BCSymbol translation, the score 
> matrix has to be defined as a dictionary, like John suggested. Such a 
> dictionary could use the symbols as the key, for instance to get the 
> score of substitution of symbolA for symbolB:
>    NSNUmber *score = [[scoreDictionary objectForKey:symbolA] 
> objectForKey:symbolB];
> (key being copied for dictionary, BCSymbol has to be immutable with 
> that design... or we could use the string representation)
> That makes matrices difficult to define programatically, but easier 
> through plist.

What about storing in .plists and representing as int* in the Object.

> OK, maybe something else, but at something fully OO and very readable.
> Final question: how do gaps fit in the matrix score thing? Is there a 
> score for a gap/symbol?

No there is not. That is a seperate option for the algorithm.

> Maybe gaps sould be excluded from the symbol <--> int conversion? They 
> would be a special case, with some special scoring schemes, like 
> gap-open, gap-extension,...?
> Well, these were my 2 cents ;-)

Very helpful, thx


More information about the Biococoa-dev mailing list