[Biococoa-dev] BCSymbolMapping (was: no subject)

Thu Mar 17 00:15:58 EST 2005

Sorry I could not reply earlier...

About optimizing later: this is a very true statement, I actually brought it up several times, but the other important thing to keep in mind is you still want to keep your code 'optimizable' later when you feel there is a chance something could be done, and not lock you up in a difficult to change implementation. What you propose is not that bad, I have to say, but still won't allow to test different options. Specifically, we would probably want to test other mapping options if we find that the algorithm spends more than 20% of the time retrieving scores from the score matrix. And I am quite confident that will happen... But there is a very good chance that I am wronf, so we would have to ask Shark. And then test different mapping if necessary. To test different mapping, we would need BCSymbolMapping. So here is what I propose:
* we try a little test program to see how much time is spent on the score retrieval for the alignment algorithm, say to align two sequences of 1000 bases and 2 sequences of 10000 bases?
* if the algorithm spends a lot of time there, then we implement BCSymbolMapping

I actually started writing the program by copying and pasting the code, but the alignement does not work. I will post the code on a separate message for Phil and you to have a look, because I have no idea how the algorithm (Koen, you are not alone!).

Now, I still need to answer some of your points and continue the battle ;-)

>Caching would be nice, but again, why not let the BCSequence do the job itself (no hassle with helper objects), it's also THE place to store the cache IMHO...
If BCSequence takes care of the mapping, yes, sure.
But if the mapping is dependent on the BCSymbolSet used for it, then no, because the symbol set may be different from the sequence symbol set.

>>The whole idea of this class, again, would be to have a separate class that takes care of the mapping, and only of the mapping:
>>
>>     objects ------> C ------> algorithm -------> C -------> Objects
>>
>>The algorithm should not know anything about the biology. I would not want to see anything like -whatevermatrix['A']['G']- in the middle of the algorithm. Having the mapping done in a separate class allows to write the algorithm like this:
>Well, perhaps I'm more humanoid, but I like it better than whatevermatrix['0x00']['0x03'];
Sorry it was not clear. My point was more that the algorithm should not know what an 'A' or a 'G' is. This is why you should not see whatevermatrix['A']['G'], and you should not see whatevermatrix['0x00']['0x03']. The algorithm could well be aligning the Bible with the BioCocoa framework code, and do the job and not care.
If the mapping is not known from the algorithm, then no risk that some assumptions are made. This is what I really meant, just separating code.

>Also it would change the code dramatically as well:
>BCSymbolSet *set=....union of the symbol sets of seq 1 and 2...    -> not necessary (unless we make the matrix creation dependent on the symbolset (see below)
>BCSymbolMapping *mapping=[BCSymbolMapping mappingWithSymbolSet:set];  
>-> not necessary
>char *seq1=[mapping charMappingForSequence:sequenceObject1];  -> same
>char *seq2=[mapping charMappingForSequence:sequenceObject2];  -> same
>int **scores=[mapping charMappingForScoreMatrix:matrix];  -> int **scores = [BCAlignment matrixForSymbolSet: set];
>// .... run the algorithm...
>BCSequenceAlignment *result=[BCAlignment alignementForSequences(int)count length:(int)length charBuffer:(char*)seqs]; -> Why make BCSymbolmapping the mother of alignments?!
BCSymbolMapping would not be the mother of anybody! It should be able to map any of the BioCocoa objects into c arrrays. Maybe not BCSequenceAlignement, as they can be reconstructed from an array of sequences, I suppose.

>>* if a score is an int or a float, the matrix is actually 128 x 128 x 4 = 64 kilobytes
>That's right, but come on, 64kb that's nothing.

This is bigger than the L1 cache of most macs out there. I believe this is the size of the L1 cache on the most recent G5. This is also 1/8 of the L2 cache. This means that the chip might even go back to RAM every time it tries to access the score matrix. And it will access the score matrix a lot, every time it compares two symbols. Like I said, Shark will tell. I just wanted to make my point about the size of that array, not in terms of RAM, but in terms of cache.

>>* it is possible that int will be better than char because of the cast step? I know it is a big issue for float to int, but I don't know about char --> int; so maybe we will use int?
>Same thing, let's make the thing and Shark will tell us.
How will Shark tell us if we cannot easily change the mapping and compare implementations with everything else equal?

>>Phil, hang in there. Let's not let these guys take us down ;-)
>GRRRRR!!!! LOL,
>Cheers mates!
>Alex

Now I am going to add a little nerve playing part...ah,ah,ah... We had a nice barbecue yesterday evening after the swimming-pool. It was so warm outside it was really a relief to get in the water. This is why I did not answer the email earlier... Or the days before.

cheers :-)

charles

-- 
Help science go fast forward:
http://cmgm.stanford.edu/~cparnot/xgrid-stanford/

Charles Parnot
charles.parnot at stanford.edu

Room  B157 in Beckman Center
279, Campus Drive
Stanford University
Stanford, CA 94305 (USA)

Tel +1 650 725 7754
Fax +1 650 725 8021