[Biophp-dev] Egad, this is getting long :-)

S Clark biophp-dev@bioinformatics.org
Tue, 29 Apr 2003 16:40:02 -0600

On Tuesday 29 April 2003 14:28, Nico Stuurman wrote:
>find the Biophp site, find no 
> code,
Okay, HERE I'll have to mention that there has been SOME code
there from day one, accessed from the menu option that says
"module code".  It is, however, a rather limited selection (the modules
mentioned in my last email, minus the new EUtils modules).
>spend (to the dismay of my 
>wife) part of the weekend to abstract them and make them useful as 
>independent units
> - get long rants about how the new parser class signifies the evil
> tendencies of GenePHP to have Window-like properties...???
> Come on Sean!  I think you are getting at a good point, but please work
> on your communication skills and think a bit before you write!

Alright, one last time, where did I EVER say that the idea was EVIL or even 

Unless you, too, seem to think that "windows" automatically equals "EVIL"?
Though as a Mac user, you might I suppose :-) 

Not that as Linux user myself that I don't have some thoughts in that
direction as well, but nonetheless, I don't recall ever injecting any
presuppositions of the "goodness" or "badness" into the discussion, and
in fact have gone out of my way to point out that I consider the windows
approach (collecting everything together into one place tightly bound 
together) COMPLEMENTARY to (not "worse than" nor "more evil" nor even
"opposite to")the unix-ish "spread everything out into lots of little places"

I apologize if I sounded like I was insulting your code, that wasn't
my intention at all.

In fact, I think your approach to the "user side" is EXACTLY correct - it's
very easy to use and I don't advocate any real change to that at all.  

It is just that in attaining this, the current parser's "backend" has been
"un-abstracted" to the extent that everything from opening the file through
reading the file and separating out the sequences through generating a
GenePHP sequence object is all enclosed into a single class, in a form that
the parsing cannot be "abstracted back out" for other uses (and possibly 
putting a great deal of metaphorical weight on the Parser object maintainer's
shoulders further down the road...).

> To summarize what I think (after endless reading) are Sean's ideas:
> 1. At the heart of the parsing class should be parser functions (one
> for each file type), that can be asked to return the current sequence,
> the next sequence, the previous sequence, the first and last sequence.
> They return the sequence in a dataformat (array?) that is different for
> each parser.  The parsers can take filenames, filepointers and strings
> as arguments.  They can do the parsing in memory or disk based.  They
> can use index files - if available - to work with large datasets.
> 2. At the next level is a class that 'translates' the return value of
> the parsers into a 'Seq' object.
> 3. Somewhere (at the top?) is a Parsers class that 'sniffs' the
> argument passed to it and sends it to the right parser.
> Not a bad idea.  The current design has some of the functionality of 1
> moved to 3 (which avoids having to rewrite it fit every parsers, but
> might lead to situations where a parser cannot do what is supposed to
> be doing).  It has point 2 integrated in 1.
> Is this about it?
> Sean, can you please try to be a little bit more to the point?

Okay, tell you what.  I will PRETEND that I am qualified to throw around
OO terminology.  I will probably abuse them terribly as a result, and I
am quite prepared to get a lecture on proper OO as a result - preferably
as soon as possible so I spend a minimum of time sounding like a clueless
wonder trying to be an OO guru...but at least I won't have to be so verbose.

I am advocating that the direct handling of the files(streams/strings) be 
abstracted out into dedicated classes, which concern themselves only with
reading the data and separating it into sequence data.  It would return a 
"standard" (by which I mean "agreed upon") ordinary array, containing at 
a minimum a "label" (short name) and the actual sequence as the
first two elements. (The returned data would NOT be different for 
every parser, only not dependent on externally-declared data structures).

The "Parser" class stays exactly as it is:

It handles detection of the data type if it is not specified.

It instantiates the appropriate parser object and tells it to go.

It hands out the sequence objects created from the parsers' output.

with only two minor changes:

1)the actual "moving back and forth in the file" (move_Next() move_Previous() 
methods) and EOF and so on end up down in the individual parsing classes - in
some file types this may require special handling (e.g. "streams" can't really
be rewound - in those cases a call to the "move_Previous()" method would
simply return false)

2)The CREATION of the sequence object be abstracted out to a "seq_factory"
class.  This is isn't really necessary to the goal of just abstracting the
data parsers, but I think it would be very handy (and useful outside of the
Parser object, for example in the event that that someone wants a set of seq
objects for fragments from the resten class instead of only strings.)  If the
structure of the sequence object is ever changed, the abstraction provided by
the "factory" object protects the objects further out from the "center" of the 
GenePHP structure from having to worry about it.  It also means that, for
example, if we ever decide we want to add support for some OTHER sequence
format (perhaps to let the BioJava guys make use of some of our PHP routines)
all that would be needed is an additional "factory" object and all of
the classes that generate sequence data can immediately support it without
any real changes.

That's all.  The "end-usage" of the Parser object doesn't change at all (like
I said, I think the approach you've developed for the Parser object's
interaction with the end-user is optimal.)

NOW am I making sense, or have I badly mangled the terminology even worse than