[Biococoa-dev] starting BCAlignment
Philipp Seibel
biococoa at bioworxx.com
Fri Mar 11 03:15:01 EST 2005
Am 11.03.2005 um 07:21 schrieb Charles PARNOT:
>> I think it's a quite good approach, but we have to decide wheter we
>> want to "ask" the matrix with two BCSymbols or just with chars. Take
>> a look at my recent implementation of the scoring matrix. It's
>> perhaps slower than this one, but more comfortable. I think we just
>> have to test the performance, when we've done the first algorithm.
>
> I was thinking about the alignement implementation while driving to
> the day care (!), and after looking at your code, I am so delighted to
> see that I had something very similar in mind. What you are doing is
> mapping symbols to int in your scoring matrix. My thought about it was
> the same, but using... symbol sets, of course. Which you probably had
> in mind, in fact, given the question you asked about 'allObjects'.
>
> The current implementation is still very much OO, which is good. Of
> course, as a result, it might be slow, with the overhead from the
> substituteSymbol:forSymbol, that scans the NSArray, and accessing the
> symbols through the sequence objects, but Shark will tell.
>
>
> Then, if we need to optimize, there is an obvious(?) path, and here is
> how we could use symbol sets:
>
> * The sequences you need to align define a SymbolSet, probaby the
> union of the symbol sets of the sequences
>
> * That instance of the BCSymbolSet classmight be then able to provide
> a perfect and reproducible bijection between that set of symbols and
> int values
> --> e.g. '(int)equivalentIntValueForSymbol:(BCSymbol *)aSymbol'
> And ONLY the BCSymbolSet class can decide on that bijection.
> One way could be to simply sort the symbols alphabetically.
> So again, one symbol in one SymbolSet = one int
> (very similar to what you did in BCScoreMatrix)
>
> * That bijection between symbols and int can be used to:
> - translate sequences into int array
> - translate the dictionary in the score matrix into int**
or int* as i mentioned before ;-).
>
> * Then the alignment algorithm manipulates only int, and is completely
> sequence-agnostic
Thats what i want it to be.
> * After alignement, everything is translated back to symbols to
> generate a BCAlignement object
>
>
> Nothing really original:
> objects ------> C ------> algorithm -------> C -------> Objects
I think that will be the approach to make it fast.
>
> The first and last arrow are the 'translation' steps. To avoid
> problems, that translation should be all in one place, which means all
> in one class. For instance, BCSymbolSet (and not BCScoreMatrix). And
> then, BCSymbolSet becomes really important in the framework. In the
> end, also, a user could create exotic symbols, exotic sequences, and
> exotic score matrices, and still use the same algorithms.
>
> A final comment about the scrore matrix in that design: because
> BCSymbolSet is in charge of the int<-->BCSymbol translation, the score
> matrix has to be defined as a dictionary, like John suggested. Such a
> dictionary could use the symbols as the key, for instance to get the
> score of substitution of symbolA for symbolB:
> NSNUmber *score = [[scoreDictionary objectForKey:symbolA]
> objectForKey:symbolB];
> (key being copied for dictionary, BCSymbol has to be immutable with
> that design... or we could use the string representation)
> That makes matrices difficult to define programatically, but easier
> through plist.
What about storing in .plists and representing as int* in the Object.
> OK, maybe something else, but at something fully OO and very readable.
>
> Final question: how do gaps fit in the matrix score thing? Is there a
> score for a gap/symbol?
No there is not. That is a seperate option for the algorithm.
> Maybe gaps sould be excluded from the symbol <--> int conversion? They
> would be a special case, with some special scoring schemes, like
> gap-open, gap-extension,...?
>
> Well, these were my 2 cents ;-)
Very helpful, thx
Phil
More information about the Biococoa-dev
mailing list