Fwd: Re: [Biococoa-dev] More on BCSymbolSets

Mon Feb 28 18:34:17 EST 2005

>  >> I've got a couple of ideas on the implementation.  One idea that
>>>  suggests
>>>  itself is to have a BCSequencType variable for the symbol set - that
>>>  way the
>>>  sequence being initialized could pick up its sequence type from the
>>>  set it
>>>  gets passed during initialization.
>>
>>  Then we might have to extend the different BCSequenceTypes to include
>>  strict, ambiguous, etc.
>Well, there are clearly going to be cases where all we care about is the
>type of sequence we're looking at, such as deciding whether we can
>complement it - quite likely these are the majority of cases.  So if we're
>trying to treat everything as a BCAbstractSeqeuence where possible, then
>asking the sequence type makes sense, as does keeping that type fairly
>broad.  I'd view the SymbolSet as providing a more detailed description than
>that provided by sequence type:  you check for it when you need more
>details, but can ignore it when you don't.

I agree a symbolSet could just be a way of getting more details about 
the type of sequence you are dealing with. The sequence type would be 
to give a broader scope: DNA, RNA, protein.
Using symbolSet to also better control the sequence behavior at 
creation and when adding symbols could make sense, though. You 
outline below a number of problems. Like you say, we coul still try 
to address them as they are not unsurmountable.

Anyway, the question at this point is: what do we want to do with 
symbolSet? If they are just a way to provide a refinement on the 
sequenceType, they we may not need a full class, but just an enum. 
And if we don't enforce the sequence contents to be consistent with 
the symbolSet, then it is useless.
Maybe also as a first step, we don't have to make symbolSet a public 
class, then. It would just allow us more flexibility. Using a 
symbolSet instead of a bunch of BOOL like 'skipUnknownSymbols', 
'allowAmbiguousSymbols', 'allowGaps',... will be much more readable 
and will allow for more extensions and more flexibility. In other 
words, it could be a tool available to us, developers, but not to the 
users of the framework.

So, what do you think symbolSet should be used for? The way I see it 
now is as a filter to restrict the symbols used in a given sequence. 
In fact, the more I think about it, the more 'filtering' seems like 
what it should do. And if we don't want any restriction, then one can 
always create very broad symbol sets.

>So, let me refine my suggestion.  We have a group of immutable sets that
>represent all the commonly used symbol sets.  We initialize these out of a
>plist file and make them singletons, as we do with the other base classes.
>We have a subclass that allows mutability, and that can obtain copies of the
>base sets for manipulation, but cannot go back and mutate the singleton
>sets.

This is perfect. I love immutable classes!
In fact, you may not even need a mutable class or even want it: what 
happens if such a mutable object is used in a sequence, and is 
changed. what do we do with the existsing symbols?.
By providing a 'symbolSetWithSymbols:(id)aSymbol,...' and some basic 
intersect/union, that should be enough to create any set in 2-3 
lines of code max.

>  > Even better, 'symbolForCharacter:' could actually be a method for the
>>  symbolSet. That would be a very good idea to allow for that code:
>I agree that it should be an option, but I'm not certain it should be the
>default use.  One of the problems I had with BioJava is that you called
>through so many classes and methods to get something simple done, it made it
>very difficult to debug problems (I kept thinking "wait, what class am I in
>now?").  My guess is that it also slowed the code down. This would also
>cause problems when it runs into characters not in the set:  would it return
>nil?  If so, you'd have to test for nil, etc.  By getting a symbol and then
>testing whether the symbol is in the set, you are always working with actual
>objects, which is safer and easier to read.
Yes, it would return nil. This is the behavior of NSDictionary, 
NSArray and NSSet. Getting the symbol and then testing if it exits in 
the symbol set, like you suggest, works fine too but is not 
necessarily easier to read. Safer, yes,maybe ;-)  but there are a few 
occasions where nil is useful, like I said with NSDictionary,...
Anyway, this is not a critical design problem.

Your concern about a bloated framework is much more important, and I 
share it. I actually was scared by BioPerl because of that (and it 
does indeed result in somewhat slow code). This is one of the reason 
I did not like the factory design in BioCocoa, when it can simply be 
in the class with a few lines of code.
However, I do think that the symbolSet class is quite 
self-explanatory, and the code I submitted is very short and thus 
very readable I believe (??), and whichever way we do it, it should 
be quite readable because the concept is sumple. Anyway, we all talk 
about using the symbolSet, so we end up having three classes: 
sequence, symbol and symbol set.

>
>I'm not fond of the following code:
>+ (BCSequenceType)sequenceType
>     { return BCDNASequence; }
>
>You're typically going to be calling this method on a BCAbstractSequence to
>find out what kind of sequence it actually is.  Making it a class method
>defeats this purpose.  If you meant this to provide what the default is for
>the class as a whole, then the method should be named to reflect that.

I am not sure I understand what you are saying.
I declared this class method, because it seems that all instances of 
a given class will have the same sequence type. Is that right? In 
fact, is there really a need for an ivar? An instance method could be 
used instead that always return the same value in a given class. Do 
you mean we should use an instance method instead, or do you mean we 
should name this method 'defaultSequenceType'?

The reason for this method and the 'defaultSymbolset' method is so 
that the init method can be factored out to the superclass. When you 
look at the code for the sequence subclass, it is always the same 
except for the symbol class being checked. This would be the same 
problem if we used symbol sets instead of symbol class to check the 
validity of symbols or strings. To be able to put the common code in 
the initializer, where no ivar has been set yet, it makes sense to 
then use methods that will provide that information at runtime. There 
are a number of Cocoa classes that use this design, like NSDocument 
with the nib name or the window controller. The initialization of the 
document calls methods in the subclass to get the names of these 
elements. I agree they are instance methods, so maybe we could use 
instance methods instead.

>Finally, I just want to sound a couple of notes of caution:
>
>By having a specific symbol set associated with a sequence, it means we have
>to do error testing whenever new symbols are added to the sequence, two
>sequences are combined, etc.  We will ALWAYS have to make sure the symbol
>set and sequenceArrays are rationalized.  I actually like this, but got
>scolded for being excessively cautious when I implemented something like
>this a while back ;).

Let's see what would be logical:
* adding a string or an array of symbols: filter using the symbol set
* appending a sequence with 'appendSequence': filter the appended 
sequence passed as argument
* concatenating sequences with a class method like 
'sequenceByConcatenatingSequences:' should first create the union of 
the symbolsets of the sequences passed as argument

>
>Another potential complication:  what happens if the complement of a base
>isn't in the same symbol set?  Say someone makes an "all pyrimadine" symbol
>set, then asks for its complement - we'd have to have some code to figure
>out what symbol set is appropriate when creating the complement, so that we
>could assign it to the resulting sequence...
>
I don't know what Koen had in mind when creating the symbol set 
class, because I see a 'complementSet' method there. Or we could 
impose that a set always is its own complement (this reminds me of 
some of my math classes, about groups and operations).
But yes, the complement sequence would have to be defined using the 
complement symbol set.

>I still think this is a good idea, but I think implementing it in a safe
>manner is going to be very challenging, and we should think through all the
>consequences and where we'd have to modify code before we even start to try
>it.
>

You know, I did not realize all these implications until now, and 
there might indeed may be other issues we don't foresee yet. The user 
should not expect too much from the framework when she starts doing 
crazy things...

Anyway, the bottom line is: should we use symbols sets or not? what for?

charles

-- 
Help science go fast forward:
http://cmgm.stanford.edu/~cparnot/xgrid-stanford/

Charles Parnot
charles.parnot at stanford.edu

Room  B157 in Beckman Center
279, Campus Drive
Stanford University
Stanford, CA 94305 (USA)

Tel +1 650 725 7754
Fax +1 650 725 8021