[Biococoa-dev] (no subject)

Mon Mar 14 03:58:28 EST 2005

Wow, seems to become a very hot topic.

> Well, perhaps I'm more humanoid, but I like it better than 
> whatevermatrix['0x00']['0x03'];

fast algorithms may not be human readable ;-)

>>
> Also it would change the code dramatically as well:
> BCSymbolSet *set=....union of the symbol sets of seq 1 and 2...    -> 
> not necessary (unless we make the matrix creation dependent on the 
> symbolset (see below)
> BCSymbolMapping *mapping=[BCSymbolMapping mappingWithSymbolSet:set];   
> -> not necessary
> char *seq1=[mapping charMappingForSequence:sequenceObject1];  -> same
> char *seq2=[mapping charMappingForSequence:sequenceObject2];  -> same
> int **scores=[mapping charMappingForScoreMatrix:matrix];  -> int 
> **scores = [BCAlignment matrixForSymbolSet: set];

can't agree with this, because we need make the scoringMatrix 
customizable, so caching and converting has to be outside the 
BCAlignment class.

Phil

> // .... run the algorithm...
> BCSequenceAlignment *result=[BCAlignment 
> alignementForSequences(int)count length:(int)length 
> charBuffer:(char*)seqs]; -> Why make BCSymbolmapping the mother of 
> alignments?!
>
>>
>> Again, I do think that mapping to the representing char of a symbol 
>> will make sense and might do the job (and will be VERY convenient for 
>> debugging), so I agree with you Koen and Alex.
>
>> But separating the mapping step allows for easier modifications in 
>> the future:
>> * it is possible that a 16 bytes score matrix will use the caches 
>> more efficiently than a 16 kilobytes; it is not just a RAM issue; L2 
>> cache is 512 kb on dual G5, not sure about L1; if may even fit in 
>> registers (?)
> Yes could be, but I really doubt if this is the bottleneck in the 
> algorithm, this would be a typical example of doing lots of tuning 
> before we even know where the problem is! Let's first make the thing 
> in the SIMPLE way and then optimize it. We can always implement the 
> remapping IF indeed there's lots to win in this area.
>
>> * if a score is an int or a float, the matrix is actually 128 x 128 x 
>> 4 = 64 kilobytes
> That's right, but come on, 64kb that's nothing.
>
>> * it is possible that int will be better than char because of the 
>> cast step? I know it is a big issue for float to int, but I don't 
>> know about char --> int; so maybe we will use int?
> Same thing, let's make the thing and Shark will tell us.
>>
>> The most important is: we don't know yet any of that and we will know 
>> only later, after running Shark on real cases.
> Aha, to early again ;-)
>> If we have everything in place to easily test and choose the best 
>> mapping, it will be easier.
> No mapping it all ;-)
>> Also, the mapping could be useful for other purposes (like saving as 
>> binary and compress, but not the best example!). Finally, if we find 
>> that we need to improve the mapping step, at least there will be 
>> mostly one class that will have to be modified.
> Or none, well you got the point. Sorry for that couldn't resist.
>> The mapping class may evolve to take more parameters and implement 
>> different approaches depending on the symbol set (at which point it 
>> would become a class cluster, but don't get me there).
>>
>> Sorry this whole email comes a bit after the discussion, but my main 
>> point is to make a case in favor of a separate class for mapping. I 
>> think it will help, and not obfuscate things, but actually separate 
>> things better, and make them clearer!
> That I have no problem with, I believe there might be a need in the 
> future for this thing, but I don't see why we would need it in 
> alignments before we start to optimize things, and thus I don't see 
> why we would implement it now if there's not yet a purpose. We can 
> better focus on writing a damn fast BCSequence to char array converter 
> ;-)
>>
>> Phil, hang in there. Let's not let these guys take us down ;-)
> GRRRRR!!!! LOL,
> Cheers mates!
> Alex