[Biophp-dev] My brain hurts.

biophp-dev@bioinformatics.org biophp-dev@bioinformatics.org
Sat, 03 May 2003 15:51:05 PST


First, the tone of my previous mail was a little too harsh (sorry!), but
my gut feeling is that some harsh discussion now can save a whole lot of
trouble later on


> I have to admit, all the extra little layers and objects that seem
> to result from OO design is one of the reasons I avoided it for so long, 
> but I have been starting to see the benefits as I play with more complex
> systems - right now, if we change the seq object (for example, changing
> the variable names to make them "private" and instituting interface
> methods instead) NOW, there's only one or two other objects we'd have to
> modify the code for to make them work.  

I realized that too (after sending my previous mail).  This seems
especially important as Serge apparently already changed the Seq object!  

> Down the road, when there are 50+ objects that all optionally or
> mandatorily generate seq objects as output, editing them all will
> be a huge chore.  On the other hand, if they're all going through
> the more abstracted seq_factory object, then THAT'S all we have to 
> change, no matter whether there are 5, 50, or 500 other objects that
> want to offer seq objects.

The only real difficulty is to have a real good seq_factory object that
is not going to change (additions are OK, but we don't want any name
changes, since that will have to be fixed in many places).

 
> In the interest of more harmonious development, however, I propose
> then that we go ahead and have the upper-level Parse object link
> directly to the seq object and comment out seq_factory references for now.

Just when you started to convince me! 

Why not extend the seq_factory object so that it can take different types 
of input (and potentially) even different types of output?  You can
default to seq_factory->Seq object, but could also allow other types of
translations.   


> the initial design too much from your original (autodetection relies on
> loading the entire set of data into memory).  I'll revert back to

So far,autodetection relied only on the first line.  Even for a stream
reading a single line is not a high price to be paid.  Simply rewrite the 
autodetect code to read a single line instead of the whole file.


> line-by-line
> parsing.  It'll take a little bit of extra code but not much (a couple
> of extra functions) and I agree with you that the standardization of
> the interface will be well worth it.

Will it be possible to do the actual parsing from an array (in memory).
For a fasta parser it is not much work writing two parsers (one for in
memory parsing, the other for streams), for the genbank,swissport and
others it is going to be a pain. Having the parsers read a record into
memory and then parse the whole thing would make it easier. 


> That won't work for "non-seekable" streams (e.g. data pulled from
> a webpage, from FTP, from a network socket, etc), though - the
> "history" idea is the best way I've been able to come up with to
> feasibly handle "one-way" streams as well as files and text in memory
> the same way.

Why not check the filesize, if small enough parse in memory, if too big
or unknown parse the stream.  For streams, movePrevious() simply returns
false.  That would allow for simple parsers that don't even have to know
whether they deal with a stream or a (array of) string(s).  In my code
the parser rewinds the pointer to the beginning of the record.  That
could be avoided by buffering just a single record.  

Hmm.  I am still unsure myself, maybe buffering is a better/nicer idea. 


 
> See above regarding non-seekable streams.  That's one accomplishment.

I don't know yet how you want to do that.


> It sounds like we're hitting a deeper problem again, though - 
> Either my approach is not proper object-oriented design (quite possible,
> as I've only begun investigating it "seriously" recently) or 
> object-oriented design is not the approach you want to use.  
> To prevent more "completely useless" code, perhaps we ought to
> work out some design documentation of some sort?

I would formulate this a little different.  There are always many ways to 
design a system (be it OO,or otherwise),and we are simply trying to find
the best possible way. Since we are with more people we have more
viewpoints, which will make the discussion phase longer, but also -
hopefully - lead to a better implementation.  I do think that it will be
easier to use objects (but don't care much eitherway).

Why don't you go ahead and put the code in cvs and show how you want to
deal with streams versus strings?


In the mean time, Serge is developing a complete system of objects that
we haven't spend a word on....

Wwhat is the name of this class going to be?  IOseq_read?  Or should
there be a IOseq object that has 'read()' as a method?


Nico