[Biococoa-dev] BCSymbolMapping (was: no subject)
Alexander Griekspoor
a.griekspoor at nki.nl
Thu Mar 17 16:36:35 EST 2005
On 17-mrt-05, at 6:15, Charles PARNOT wrote:
> Sorry I could not reply earlier...
>
> About optimizing later: this is a very true statement, I actually
> brought it up several times, but the other important thing to keep in
> mind is you still want to keep your code 'optimizable' later when you
> feel there is a chance something could be done, and not lock you up in
> a difficult to change implementation.
True.
> What you propose is not that bad, I have to say, but still won't allow
> to test different options. Specifically, we would probably want to
> test other mapping options if we find that the algorithm spends more
> than 20% of the time retrieving scores from the score matrix. And I am
> quite confident that will happen... But there is a very good chance
> that I am wronf, so we would have to ask Shark. And then test
> different mapping if necessary. To test different mapping, we would
> need BCSymbolMapping. So here is what I propose:
> * we try a little test program to see how much time is spent on the
> score retrieval for the alignment algorithm, say to align two
> sequences of 1000 bases and 2 sequences of 10000 bases?
> * if the algorithm spends a lot of time there, then we implement
> BCSymbolMapping
Yes, exactly the idea, only start such implementations once you have
actually seen that the problem is there.
>
> I actually started writing the program by copying and pasting the
> code, but the alignement does not work. I will post the code on a
> separate message for Phil and you to have a look, because I have no
> idea how the algorithm (Koen, you are not alone!).
Did my comments in the .m file help?
>
> Now, I still need to answer some of your points and continue the
> battle ;-)
Oh oh...
>
>> Caching would be nice, but again, why not let the BCSequence do the
>> job itself (no hassle with helper objects), it's also THE place to
>> store the cache IMHO...
> If BCSequence takes care of the mapping, yes, sure.
> But if the mapping is dependent on the BCSymbolSet used for it, then
> no, because the symbol set may be different from the sequence symbol
> set.
True, if we implement the mapping the situation is different indeed.
>
>>> The whole idea of this class, again, would be to have a separate
>>> class that takes care of the mapping, and only of the mapping:
>>>
>>> objects ------> C ------> algorithm -------> C -------> Objects
>>>
>>> The algorithm should not know anything about the biology. I would
>>> not want to see anything like -whatevermatrix['A']['G']- in the
>>> middle of the algorithm. Having the mapping done in a separate class
>>> allows to write the algorithm like this:
>> Well, perhaps I'm more humanoid, but I like it better than
>> whatevermatrix['0x00']['0x03'];
> Sorry it was not clear. My point was more that the algorithm should
> not know what an 'A' or a 'G' is. This is why you should not see
> whatevermatrix['A']['G'], and you should not see
> whatevermatrix['0x00']['0x03']. The algorithm could well be aligning
> the Bible with the BioCocoa framework code, and do the job and not
> care.
> If the mapping is not known from the algorithm, then no risk that some
> assumptions are made. This is what I really meant, just separating
> code.
Yep that's a good point, again the same thing applies as above, IF we
go for the mapping you're absolutely right.
>
>> Also it would change the code dramatically as well:
>> BCSymbolSet *set=....union of the symbol sets of seq 1 and 2... ->
>> not necessary (unless we make the matrix creation dependent on the
>> symbolset (see below)
>> BCSymbolMapping *mapping=[BCSymbolMapping mappingWithSymbolSet:set];
>> -> not necessary
>> char *seq1=[mapping charMappingForSequence:sequenceObject1]; -> same
>> char *seq2=[mapping charMappingForSequence:sequenceObject2]; -> same
>> int **scores=[mapping charMappingForScoreMatrix:matrix]; -> int
>> **scores = [BCAlignment matrixForSymbolSet: set];
>> // .... run the algorithm...
>> BCSequenceAlignment *result=[BCAlignment
>> alignementForSequences(int)count length:(int)length
>> charBuffer:(char*)seqs]; -> Why make BCSymbolmapping the mother of
>> alignments?!
> BCSymbolMapping would not be the mother of anybody! It should be able
> to map any of the BioCocoa objects into c arrrays.
Hmm, here I am again, I would then still vote to have convenience
methods as well (that work via BCSymbolMapping objects). I just like to
call [myBCSequence mapping] (which uses the BCSequence' symbolset by
default)
and
[myBCSequence mappingWithSymbolSet:set] (which allows you to use a
different set)
instead of having to go through the helper object explicitly.
Understand me well, I have to problem with the fact that it can be done
as above, doing things manually can be handy to cache the object for
instance (like the example above where you use mapping a few times in a
row), but in general I hate the BioJava exorbitant use of factories and
helper objects. NSString as an example, imagine having to great a
helper object any time you want its filesystemRepresentation....
>>> * if a score is an int or a float, the matrix is actually 128 x 128
>>> x 4 = 64 kilobytes
>> That's right, but come on, 64kb that's nothing.
>
> This is bigger than the L1 cache of most macs out there. I believe
> this is the size of the L1 cache on the most recent G5. This is also
> 1/8 of the L2 cache. This means that the chip might even go back to
> RAM every time it tries to access the score matrix. And it will access
> the score matrix a lot, every time it compares two symbols. Like I
> said, Shark will tell. I just wanted to make my point about the size
> of that array, not in terms of RAM, but in terms of cache.
Ok, well I'm definitely off my terrain here, so stupid things may now
follow. But if you know the position in the matrix you want the value
for, and you have the pointer to the memory location, do you really
have to feed the whole matrix into the processor's cache? [ignorant
fool's talk] Why can't it just read an int from that memory position?
[/ignorant fool's talk]. It's embarrassing to know so little about how
these things work...
>
>>> * it is possible that int will be better than char because of the
>>> cast step? I know it is a big issue for float to int, but I don't
>>> know about char --> int; so maybe we will use int?
>> Same thing, let's make the thing and Shark will tell us.
> How will Shark tell us if we cannot easily change the mapping and
> compare implementations with everything else equal?
I thought the conclusion was to ask Shark if there's a problem at all
centered around this step and then implement the mapping and see if it
helps ;-)
>
> Now I am going to add a little nerve playing part...ah,ah,ah... We had
> a nice barbecue yesterday evening after the swimming-pool. It was so
> warm outside it was really a relief to get in the water. This is why I
> did not answer the email earlier... Or the days before.
That's playing unfair, shitty Dutch weather...
Cheers,
Alex
>
**************************************************************
** Alexander Griekspoor **
**************************************************************
The Netherlands Cancer Institute
Department of Tumorbiology (H4)
Plesmanlaan 121, 1066 CX, Amsterdam
Tel: + 31 20 - 512 2023
Fax: + 31 20 - 512 2029
AIM: mekentosj at mac.com
E-mail: a.griekspoor at nki.nl
Web: http://www.mekentosj.com
MacOS X: The power of UNIX with the simplicity of the Mac
***************************************************************
*********************************************************
** Alexander Griekspoor **
*********************************************************
The Netherlands Cancer Institute
Department of Tumorbiology (H4)
Plesmanlaan 121, 1066 CX, Amsterdam
Tel: + 31 20 - 512 2023
Fax: + 31 20 - 512 2029
AIM: mekentosj at mac.com
E-mail: a.griekspoor at nki.nl
Web: http://www.mekentosj.com
Claiming that the Macintosh is inferior to Windows
because most people use Windows, is like saying
that all other restaurants serve food that is
inferior to McDonalds
*********************************************************
More information about the Biococoa-dev
mailing list