[Biococoa-dev] (no subject)
charles.parnot at stanford.edu
Mon Mar 14 03:13:04 EST 2005
At 9:28 AM +0100 3/12/05, Alexander Griekspoor wrote:
>Sounds awesome Charles, great ideas. I guess many algorithms can benefit from this approach. It's indeed very wise to "standardize" this conversion path and provide some "legal" form to go from symbols to c structures and vice versa for performance reasons in the case that native use of BCSequences is not possible or does not suffice.
At 10:41 PM +0100 3/13/05, Alexander Griekspoor wrote:
>Hmmm, somehow I totally miss the reason the remapping. Why would it be leaner/faster?
Somehow, you have to explain that in more details ;-)
The BCSymbolMapping class proposed by Phil is exactly what I had in mind. I would add the following methods:
- (char *)charMappingForSequence:(BCAbstractSequence *)sequence;
- (char **)charMappingForScoreMatrix:yadayada..;
... and the same backwards...
The BCSymbolMapping can even take care of the malloc, like put above (with automatic autorelease; I can give more details how). It could implement some caching in the future if needed (@Phil: BTW, I would rather have BCSymbolMapping do the caching than BCScoreMatrix, ref: a previous email from you, see what I mean?).
The whole idea of this class, again, would be to have a separate class that takes care of the mapping, and only of the mapping:
objects ------> C ------> algorithm -------> C -------> Objects
The algorithm should not know anything about the biology. I would not want to see anything like -whatevermatrix['A']['G']- in the middle of the algorithm. Having the mapping done in a separate class allows to write the algorithm like this:
BCSymbolSet *set=....union of the symbol sets of seq 1 and 2...
BCSymbolMapping *mapping=[BCSymbolMapping mappingWithSymbolSet:set];
char *seq1=[mapping charMappingForSequence:sequenceObject1];
char *seq2=[mapping charMappingForSequence:sequenceObject2];
int **scores=[mapping charMappingForScoreMatrix:matrix];
// .... run the algorithm...
BCSequenceAlignment *result=[mapping alignementForSequences(int)count length:(int)length charBuffer:(char*)seqs];
Again, I do think that mapping to the representing char of a symbol will make sense and might do the job (and will be VERY convenient for debugging), so I agree with you Koen and Alex. But separating the mapping step allows for easier modifications in the future:
* it is possible that a 16 bytes score matrix will use the caches more efficiently than a 16 kilobytes; it is not just a RAM issue; L2 cache is 512 kb on dual G5, not sure about L1; if may even fit in registers (?)
* if a score is an int or a float, the matrix is actually 128 x 128 x 4 = 64 kilobytes
* it is possible that int will be better than char because of the cast step? I know it is a big issue for float to int, but I don't know about char --> int; so maybe we will use int?
The most important is: we don't know yet any of that and we will know only later, after running Shark on real cases. If we have everything in place to easily test and choose the best mapping, it will be easier. Also, the mapping could be useful for other purposes (like saving as binary and compress, but not the best example!). Finally, if we find that we need to improve the mapping step, at least there will be mostly one class that will have to be modified. The mapping class may evolve to take more parameters and implement different approaches depending on the symbol set (at which point it would become a class cluster, but don't get me there).
Sorry this whole email comes a bit after the discussion, but my main point is to make a case in favor of a separate class for mapping. I think it will help, and not obfuscate things, but actually separate things better, and make them clearer!
Phil, hang in there. Let's not let these guys take us down ;-)
Help science go fast forward:
charles.parnot at stanford.edu
Room B157 in Beckman Center
279, Campus Drive
Stanford, CA 94305 (USA)
Tel +1 650 725 7754
Fax +1 650 725 8021
More information about the Biococoa-dev