[Biococoa-dev] More on BCSymbolSets

John Timmer jtimmer at bellatlantic.net
Mon Feb 28 12:04:36 EST 2005


I'm replying to Koen's mail, but I'm going to cut and paste in some of
Charles's email as well, since it's all on related stuff.
 
>> I think the idea of using a symbol set to limit the possible options
>> for
>> initializing a sequence is a good one, provided we make the process
>> very
>> streamlined.  The class itself looks to be good in that regard.
> 
> Not only in the init methods. They can be used all throughout the
> framework.
I wasn't implying they couldn't, just that this was an obvious case for
their use.  I could also see how implementing this would turn the
"containsAmbiguousSymbols" method into about a 4 liner instead of a big
loop.

 
>> I've got a couple of ideas on the implementation.  One idea that
>> suggests
>> itself is to have a BCSequencType variable for the symbol set - that
>> way the
>> sequence being initialized could pick up its sequence type from the
>> set it
>> gets passed during initialization.
> 
> Then we might have to extend the different BCSequenceTypes to include
> strict, ambiguous, etc.
Well, there are clearly going to be cases where all we care about is the
type of sequence we're looking at, such as deciding whether we can
complement it - quite likely these are the majority of cases.  So if we're
trying to treat everything as a BCAbstractSeqeuence where possible, then
asking the sequence type makes sense, as does keeping that type fairly
broad.  I'd view the SymbolSet as providing a more detailed description than
that provided by sequence type:  you check for it when you need more
details, but can ignore it when you don't.


>> If we're making symbol sets this central to sequence creation, though,
>> I'd
>> make a lot of combinations, rather than the two we have for each type.
>>  We
>> don't want any of the commonly used sets more than a single call away.
> 
> Sets can always be combined, using formUnionWithSymbolSet and
> formIntersectionWithSymbolSet.
That's true.  Which could actually be a problem.  If we have a singleton
instance for all non-ambiguous amino acids, what happens if somebody deletes
the alanine?  You've just screwed the entire runtime.

So, let me refine my suggestion.  We have a group of immutable sets that
represent all the commonly used symbol sets.  We initialize these out of a
plist file and make them singletons, as we do with the other base classes.
We have a subclass that allows mutability, and that can obtain copies of the
base sets for manipulation, but cannot go back and mutate the singleton
sets.

> Even better, 'symbolForCharacter:' could actually be a method for the
> symbolSet. That would be a very good idea to allow for that code:
I agree that it should be an option, but I'm not certain it should be the
default use.  One of the problems I had with BioJava is that you called
through so many classes and methods to get something simple done, it made it
very difficult to debug problems (I kept thinking "wait, what class am I in
now?").  My guess is that it also slowed the code down. This would also
cause problems when it runs into characters not in the set:  would it return
nil?  If so, you'd have to test for nil, etc.  By getting a symbol and then
testing whether the symbol is in the set, you are always working with actual
objects, which is safer and easier to read.


I'm not fond of the following code:
+ (BCSequenceType)sequenceType
    { return BCDNASequence; }

You're typically going to be calling this method on a BCAbstractSequence to
find out what kind of sequence it actually is.  Making it a class method
defeats this purpose.  If you meant this to provide what the default is for
the class as a whole, then the method should be named to reflect that.


Finally, I just want to sound a couple of notes of caution:

By having a specific symbol set associated with a sequence, it means we have
to do error testing whenever new symbols are added to the sequence, two
sequences are combined, etc.  We will ALWAYS have to make sure the symbol
set and sequenceArrays are rationalized.  I actually like this, but got
scolded for being excessively cautious when I implemented something like
this a while back ;).
 
Another potential complication:  what happens if the complement of a base
isn't in the same symbol set?  Say someone makes an "all pyrimadine" symbol
set, then asks for its complement - we'd have to have some code to figure
out what symbol set is appropriate when creating the complement, so that we
could assign it to the resulting sequence...

I still think this is a good idea, but I think implementing it in a safe
manner is going to be very challenging, and we should think through all the
consequences and where we'd have to modify code before we even start to try
it.


 
Think that's everything for now...

JT

_______________________________________________
This mind intentionally left blank





More information about the Biococoa-dev mailing list