[Biococoa-dev] more ramblings

Tue Nov 16 20:30:47 EST 2004

Hi,

In two recent code cleanups I did (rangeOfSubsequence and initializing 
the symbols) I found that code that was originally in each subclass 
could be moved either to the super or to an external wrapper. I hope 
you can appreciate that the code became  more transparent and also more 
easy to maintain. For example, during the coding of BCFindSequence, I 
found an error in the rangeOfSubsequence code (see my post October 
30th). Once I found the problem, it was easy to fix with 
BCFindSequence, because the code is just in one place, instead of in 
each variation of rangeOfSubsequence in all the subclasses (which I 
didn't fix yet ;).

I would appreciate it if you could check and try out the code in 
BCFindSequence. I already put some test code in the translation demo. 
Here are the relevant lines in the demo:

	BCFindSequence *sequenceFinder = [BCFindSequence 
sequenceFinderWithSequence: theSequence];
	[sequenceFinder setStrict: NO];
	[sequenceFinder setFirstOnly: NO];

	NSArray *foundIt = [sequenceFinder findSequence:
		[BCSequenceDNA DNASequenceWithString: @"AAT" skippingNonBases: YES]];

	NSLog ( @"the found-array is %@", foundIt );

Try changing the setStrict and setFirstOnly values, and the @"AAT" 
search string, and see if the results displayed by NSLog in the console 
are what you expect. Note that the results in 'foundIt' are stored as 
NSRanges in NSValue, we way have to change that. Maybe you can try to 
put an ambiguous symbol in the search string. Try feeding it a protein, 
or rna. If I have done everything right, BCFindSequence should be 
similar to all the variations of rangeOfSubsequence in BCSequence and 
its subclasses. If not let me know what went wrong and I can see if I 
can fix it.

By introducing BCFindSequence, I hope I showed that we don't need all 
the variations of rangeOfSubsequence in multiple locations. I am 
confident that the same applies for other sequence manipulations. For 
instance, code to calculate a complement or reverse complement could 
also go into a wrapper class. Code to translate a sequence is already 
in a wrapper class.

You probably can guess where I am going next :-)

Having said all that, again I want to make a case that we don't have to 
subclass BCSequence. A sequence object IMO should only take care of 
maintaining the array of symbols, and maybe store additional 
information about the sequence, such as annotations and features. I 
don't think this is distorting biology, because in real life, DNA and 
proteins also use additional proteins to extend their behaviour 
(translate, get the complement, look for a epitope, digest, transport 
through the membrane, etc).

Another advantage is the following. Last week I asked for a way to 
determine if a fasta file contains a dna or protein. We don't know in 
advance, so what should the readFasta method return, BCSequenceProtein 
or BCSequenceDNA? If we just have readFasta return a BCSequence the 
read-method doesn't have to worry about that! Of course, when actually 
creating the sequence, we could either set BCSequenceType or a 
introduce a symbolset/alphabet, so at least we and the user knows what 
we are dealing with. But this is not the responsibility of readFasta 
which only extracts the relevant information from a file, and passes it 
on the code that creates a sequence.

I hope that with showing some concreate examples that this time I can 
convince you guys that we don't have to subclass BCSequence, or at 
least use wrappers for all additional functionality.

please now go ahead and shoot me ;-)

cheers,

- Koen.