[Biococoa-dev] more ramblings

Fri Nov 26 08:39:18 EST 2004

On Nov 25, 2004, at 5:13 PM, Alexander Griekspoor wrote:

>> Right. See my other post, IMO a readXXX file should only parse the 
>> input, and then pass on the requested info to a class that creates a 
>> sequence.
>
> Yes, but again, I don't see why that would require different classes. 
> The readXXXFile METHOD should indeed only parse the input, but ANOTHER 
> METHOD in the same class could help return the proper sequence. An 
> idea of methods to implement:
> - a general method to determine (guess) the filetype
> - a general method to determine (guess) the sequence type (protein, 
> dna, rna)
> - methods that do the parsing
> 	- based on the sequence type determination
> 	- based on the type as a (user set) argument

Yes, it definitely should be a separate method. My point is that there 
are (or will be) more places in BioCocoa where a sequence is created. 
So then we should have the guess methods in that class too. If we use 
an intermediate factory class, *all* sequence creation code goes 
through one central location. Again, this is easier to maintain, allows 
to add different file/sequence types, etc. Another solution instead of 
the factory could be to have a separate guess-the-type class.

*** (merging two emails here, for conveniece ;) ***

>> You are absolutely right that it is a problem to create an untyped 
>> BCSequence, that's not what I was trying to say. My point was that 
>> readFasta cannot always know if it is a protein or nucleotide 
>> sequence, so we let it just create a BCSequence.
> Right, but what I try to make clear is that that is only shifting 
> ahead the problem... The question is whether we want to make all 
> methods compatible with untyped sequences as a consequence. I don't 
> think so, but perhaps you guys think differently.

Again, I don't want to create untyped sequences, sorry if I was unclear 
about that. My point was to have readFasta (and all the others) return 
a BCSequence so we don't have to hardcode the return type. But we 
should *always* add an identifier whether it is dna, protein, etc. This 
is where the guess-the-type code come is place. Come to think of it, 
symbolsets could be really useful here, and will cover almost any 
situation, except for the hypothetical AAAAAAAA or CCCCCCCC sequences. 
That is never solvable, and needs input from the user.

>> If we just create a BCSequence, the readFasta method will always work.
> Sure, but I still haven't heard a solution of the most important 
> problem. Those characters that have an equivalent BCSymbol in multiple 
> types, like A (Alanine and Adenosine). You can only solve this problem 
> if you also introduce untyped BCSymbols, but as you can't add MW's and 
> other properties (because you don't know what it represents) to them, 
> they are merely replacements for characters. Also, what in the world 
> would you return if you feed such a thing to an object that calculates 
> it molecular weight? Get the problems we will get ourselves into?

I do, don't worry ;-) I still vote to use either a symbolset, or use 
BCSequenceType to differentiate. Once that is known, we know which 
subclass of BCSymbol to use, and the untyped BCSymbol problem as you 
describe above is non existent.

>>  It's only task IMO should be to parse the file (which should have a 
>> constant structure, independent of the sequence type, so it works 
>> always), extract the requested data, and pass it on to the class that 
>> actually creates a new BCSequence object.
> Hmm, ok, if you see it that way that's a possibility yes. Still it 
> sounds more complicated than necessary. If you read a fasta file, you 
> want a BCSequence (or a group of them) right? Why do it in two steps? 
> I think there's plenty to distill in general METHODS within the 
> sequenceIO class that all readXXX methods can use. It would keep 
> things limited to one class though.

See my argument above, readFile is not the only class that creates 
sequences

>
>> I think it is the responsibility of the user/caller to ask for either 
>> protein or dna or rna, by passing the right sequence type or symbol 
>> set.
> So, then you can just as well ask him to tell us right away, and 
> instantiate the right BCSequence type immediately!

Yes, that would be the first choice (asking the user, I mean) , but 
stillI would either use BCSequenceType or the appropriate symbol set, 
not subclass BCSequence.

cheers,

- Koen.