Fwd: Re: [Biococoa-dev] More on BCSymbolSets
Charles PARNOT
charles.parnot at stanford.edu
Mon Feb 28 18:34:17 EST 2005
> >> I've got a couple of ideas on the implementation. One idea that
>>> suggests
>>> itself is to have a BCSequencType variable for the symbol set - that
>>> way the
>>> sequence being initialized could pick up its sequence type from the
>>> set it
>>> gets passed during initialization.
>>
>> Then we might have to extend the different BCSequenceTypes to include
>> strict, ambiguous, etc.
>Well, there are clearly going to be cases where all we care about is the
>type of sequence we're looking at, such as deciding whether we can
>complement it - quite likely these are the majority of cases. So if we're
>trying to treat everything as a BCAbstractSeqeuence where possible, then
>asking the sequence type makes sense, as does keeping that type fairly
>broad. I'd view the SymbolSet as providing a more detailed description than
>that provided by sequence type: you check for it when you need more
>details, but can ignore it when you don't.
I agree a symbolSet could just be a way of getting more details about
the type of sequence you are dealing with. The sequence type would be
to give a broader scope: DNA, RNA, protein.
Using symbolSet to also better control the sequence behavior at
creation and when adding symbols could make sense, though. You
outline below a number of problems. Like you say, we coul still try
to address them as they are not unsurmountable.
Anyway, the question at this point is: what do we want to do with
symbolSet? If they are just a way to provide a refinement on the
sequenceType, they we may not need a full class, but just an enum.
And if we don't enforce the sequence contents to be consistent with
the symbolSet, then it is useless.
Maybe also as a first step, we don't have to make symbolSet a public
class, then. It would just allow us more flexibility. Using a
symbolSet instead of a bunch of BOOL like 'skipUnknownSymbols',
'allowAmbiguousSymbols', 'allowGaps',... will be much more readable
and will allow for more extensions and more flexibility. In other
words, it could be a tool available to us, developers, but not to the
users of the framework.
So, what do you think symbolSet should be used for? The way I see it
now is as a filter to restrict the symbols used in a given sequence.
In fact, the more I think about it, the more 'filtering' seems like
what it should do. And if we don't want any restriction, then one can
always create very broad symbol sets.
>So, let me refine my suggestion. We have a group of immutable sets that
>represent all the commonly used symbol sets. We initialize these out of a
>plist file and make them singletons, as we do with the other base classes.
>We have a subclass that allows mutability, and that can obtain copies of the
>base sets for manipulation, but cannot go back and mutate the singleton
>sets.
This is perfect. I love immutable classes!
In fact, you may not even need a mutable class or even want it: what
happens if such a mutable object is used in a sequence, and is
changed. what do we do with the existsing symbols?.
By providing a 'symbolSetWithSymbols:(id)aSymbol,...' and some basic
intersect/union, that should be enough to create any set in 2-3
lines of code max.
> > Even better, 'symbolForCharacter:' could actually be a method for the
>> symbolSet. That would be a very good idea to allow for that code:
>I agree that it should be an option, but I'm not certain it should be the
>default use. One of the problems I had with BioJava is that you called
>through so many classes and methods to get something simple done, it made it
>very difficult to debug problems (I kept thinking "wait, what class am I in
>now?"). My guess is that it also slowed the code down. This would also
>cause problems when it runs into characters not in the set: would it return
>nil? If so, you'd have to test for nil, etc. By getting a symbol and then
>testing whether the symbol is in the set, you are always working with actual
>objects, which is safer and easier to read.
Yes, it would return nil. This is the behavior of NSDictionary,
NSArray and NSSet. Getting the symbol and then testing if it exits in
the symbol set, like you suggest, works fine too but is not
necessarily easier to read. Safer, yes,maybe ;-) but there are a few
occasions where nil is useful, like I said with NSDictionary,...
Anyway, this is not a critical design problem.
Your concern about a bloated framework is much more important, and I
share it. I actually was scared by BioPerl because of that (and it
does indeed result in somewhat slow code). This is one of the reason
I did not like the factory design in BioCocoa, when it can simply be
in the class with a few lines of code.
However, I do think that the symbolSet class is quite
self-explanatory, and the code I submitted is very short and thus
very readable I believe (??), and whichever way we do it, it should
be quite readable because the concept is sumple. Anyway, we all talk
about using the symbolSet, so we end up having three classes:
sequence, symbol and symbol set.
>
>I'm not fond of the following code:
>+ (BCSequenceType)sequenceType
> { return BCDNASequence; }
>
>You're typically going to be calling this method on a BCAbstractSequence to
>find out what kind of sequence it actually is. Making it a class method
>defeats this purpose. If you meant this to provide what the default is for
>the class as a whole, then the method should be named to reflect that.
I am not sure I understand what you are saying.
I declared this class method, because it seems that all instances of
a given class will have the same sequence type. Is that right? In
fact, is there really a need for an ivar? An instance method could be
used instead that always return the same value in a given class. Do
you mean we should use an instance method instead, or do you mean we
should name this method 'defaultSequenceType'?
The reason for this method and the 'defaultSymbolset' method is so
that the init method can be factored out to the superclass. When you
look at the code for the sequence subclass, it is always the same
except for the symbol class being checked. This would be the same
problem if we used symbol sets instead of symbol class to check the
validity of symbols or strings. To be able to put the common code in
the initializer, where no ivar has been set yet, it makes sense to
then use methods that will provide that information at runtime. There
are a number of Cocoa classes that use this design, like NSDocument
with the nib name or the window controller. The initialization of the
document calls methods in the subclass to get the names of these
elements. I agree they are instance methods, so maybe we could use
instance methods instead.
>Finally, I just want to sound a couple of notes of caution:
>
>By having a specific symbol set associated with a sequence, it means we have
>to do error testing whenever new symbols are added to the sequence, two
>sequences are combined, etc. We will ALWAYS have to make sure the symbol
>set and sequenceArrays are rationalized. I actually like this, but got
>scolded for being excessively cautious when I implemented something like
>this a while back ;).
Let's see what would be logical:
* adding a string or an array of symbols: filter using the symbol set
* appending a sequence with 'appendSequence': filter the appended
sequence passed as argument
* concatenating sequences with a class method like
'sequenceByConcatenatingSequences:' should first create the union of
the symbolsets of the sequences passed as argument
>
>Another potential complication: what happens if the complement of a base
>isn't in the same symbol set? Say someone makes an "all pyrimadine" symbol
>set, then asks for its complement - we'd have to have some code to figure
>out what symbol set is appropriate when creating the complement, so that we
>could assign it to the resulting sequence...
>
I don't know what Koen had in mind when creating the symbol set
class, because I see a 'complementSet' method there. Or we could
impose that a set always is its own complement (this reminds me of
some of my math classes, about groups and operations).
But yes, the complement sequence would have to be defined using the
complement symbol set.
>I still think this is a good idea, but I think implementing it in a safe
>manner is going to be very challenging, and we should think through all the
>consequences and where we'd have to modify code before we even start to try
>it.
>
You know, I did not realize all these implications until now, and
there might indeed may be other issues we don't foresee yet. The user
should not expect too much from the framework when she starts doing
crazy things...
Anyway, the bottom line is: should we use symbols sets or not? what for?
charles
--
Help science go fast forward:
http://cmgm.stanford.edu/~cparnot/xgrid-stanford/
Charles Parnot
charles.parnot at stanford.edu
Room B157 in Beckman Center
279, Campus Drive
Stanford University
Stanford, CA 94305 (USA)
Tel +1 650 725 7754
Fax +1 650 725 8021
More information about the Biococoa-dev
mailing list