[Biophp-dev] My brain hurts.

S Clark biophp-dev@bioinformatics.org
Sat, 3 May 2003 11:51:53 -0600


On Friday 02 May 2003 12:30 pm, Nico Stuurman wrote:

> OK, that makes the seq_factory class completely useless.  Why not let
> the parsers return the very neutral seq object directly?  Why introduce
> another layer that does the exact same thing as the seq object?  Where
> is the (complete) design of the 'neutral, intermediary' format?

"completely useless"?

I'm not metaphorically married to abstracting the seq object creation, but
my gut feeling is that abstracting creation of the specialized seq object
is a useful thing, especially as we begin adding more and more modules
that may or may not want to create seq objects.

> If this is really your idea, I vote against it.  It simply builds an
> extra layer that does exactly the same thing the seq object is supposed
> to be doing.  If you do not agree with the design of the seq object,
> that is another story, and something that should be discussed.

It's not the design of the seq object - the seq_factory (analogous to
BioPerl's seqfactory
[http://doc.bioperl.org/releases/bioperl-1.2/Bio/Seq/SeqFactory.html ]
and BioJava's "SequenceFactory"
[http://www.biojava.org/docs/api/org/biojava/bio/seq/SequenceFactory.html ],
isn't designed to REPLACE the actual seq object, but to provide
a central interface for instantiating instances of the seq object.

I have to admit, all the extra little layers and objects that seem
to result from OO design is one of the reasons I avoided it for so long, 
but I have been starting to see the benefits as I play with more complex
systems - right now, if we change the seq object (for example, changing
the variable names to make them "private" and instituting interface
methods instead) NOW, there's only one or two other objects we'd have to
modify the code for to make them work.  

Down the road, when there are 50+ objects that all optionally or
mandatorily generate seq objects as output, editing them all will
be a huge chore.  On the other hand, if they're all going through
the more abstracted seq_factory object, then THAT'S all we have to 
change, no matter whether there are 5, 50, or 500 other objects that
want to offer seq objects.

In the interest of more harmonious development, however, I propose
then that we go ahead and have the upper-level Parse object link
directly to the seq object and comment out seq_factory references for now.

> Yes, but it puts the complete file in an array of strings (just like my
> previous version did).  So, it can not deal with streams until it read
> the whole thing into memory.

Yeah, you're right there - At the time thinking that stream and memory
parsers would have to be separate, I went to the extreme of taking
full advantage of the memory-based model - and to keep from changing
the initial design too much from your original (autodetection relies on
loading the entire set of data into memory).  I'll revert back to line-by-line
parsing.  It'll take a little bit of extra code but not much (a couple
of extra functions) and I agree with you that the standardization of
the interface will be well worth it.

> As I wrote now a couple of times, an elegant way would be to make
> functions 'next','current','previous','each', that not only work on
> arrays (simply pass the call on the php function), but also on streams
> (do an fgets, and keep track of the file pointer).

That won't work for "non-seekable" streams (e.g. data pulled from
a webpage, from FTP, from a network socket, etc), though - the
"history" idea is the best way I've been able to come up with to
feasibly handle "one-way" streams as well as files and text in memory
the same way.

> I am lost now about what you want to accomplish.  You don't like the
> parsers to return an object, but you do want them to return an array
> that has to adhere to very specific descriptions?  This looks more like
> a psychological hangup over classes and objects rather than anything
> meaningful.  OR seq_factory knows what file format it is dealing with
> and translates from the parsers own format to the seq class, OR we have
> the parsers return a seq object directly.  Just making yet another
> format to store the data and have that as an intermediate is plain out
> silly.

The idea is to go for a more object-oriented approach, making each object
dependent on as few other objects as possible, so as to minimize maintenance
and maximize code reuse down the road.

As I mentioned before, I WOULD like to add some "translation" to the
seq_factory so as to make it more "lenient" in terms of what it can 
understand, but "knowing the file format" seems like it should be
contained ENTIRELY in the objects that are dedicated to that purpose, e.g.
the filetype parsers.  

> I am now a little unsure what you accomplished.  You replaced the seq
> object by an 'intermediate' array that is basically an object without
> functions (but otherwise exactly the same as what the seq object
> intends to be), and you replaced the approach of keeping track of
> positions by tracking the index (pointer) to the data, by buffering the
> data in an array (causing much more data being moved in memory than
> otherwise would be needed).  I don't think progress has been made in
> dealing with streams.  Am I missing something?

See above regarding non-seekable streams.  That's one accomplishment.

Another is the added ease of maintenance gained by separating
things into smaller, more self-contained objects.

The added memory used in my initial rearrangement of your original
design is partly offset now by freeing the memory taken by the original
text data (needed for the autodetection routines - I didn't think me
messing with your original routines there would accomplish anything
but probably messing them up :-) ) once it is passed to the filetype
parser.  Once the filetype parsers are reverted back to a line-by-line 
approach, we can have them free up the individual lines (when they are 
in memory as opposed to being in streams, which will only need to be stored 
one record at a time in memory) as it uses them, so that will
take care of the extra memory problem.

It sounds like we're hitting a deeper problem again, though - 
Either my approach is not proper object-oriented design (quite possible,
as I've only begun investigating it "seriously" recently) or 
object-oriented design is not the approach you want to use.  
To prevent more "completely useless" code, perhaps we ought to
work out some design documentation of some sort?