[Biophp-dev] Egad, this is getting long :-)

S Clark biophp-dev@bioinformatics.org
Tue, 29 Apr 2003 13:37:51 -0600


However, HOPEFULLY this message will manage to articulate what
I've not been articulating very well up to this point...

On Tuesday 29 April 2003 02:20, Serge Gregorio wrote:
> Hello all!
>
> >SC: It just seems we have two different approaches going on at the same
> > time - with GenePHP it has more of a "Windows" type approach of tying
> > things together into larger sets of easy-to-use segments ... while I ha=
ve
> > been going for more of a "Unix" type of approach of lots of small,
> > independent modules ...
>
> I think the analogy is not accurate as Windows is more than just large se=
ts
> of easy to use segments.  It is a CLOSED system.  GenePHP and its code is
> very much "open".

Obviously true, I was referring only to the "design style" not the
moral philosophy :-)

> Having said that, I suggest we avoid making broad,
> sweeping comments like the above, and be more specific like below.  Using
> labels like "monolithic" is acceptable, but controversial labels like
> "Windows" (synonymous to "Nazi" for some) must be avoided. We dont' want =
to
> drive away Windows developers from the project, do we?

Sheesh!  I didn't realize even WINDOWS users thought "windows" was a bad wo=
rd!
Normally I'd make a joke at this point about how "of course we don't
wany any stinky old windows users" (pretending that I don't know that YOU a=
re
a windows user), but given my communication success record the last couple =
of
days someone would probably think I was serious, so I won't :-)

I was referring not at all to anything specifically BAD about windows desig=
n,=20
only that windows programmers seem to always tend towards putting everything
inside of one application, while unix programmers tend towards chaining
lots of little applications together (somewhat analogous to my rudimentary
understanding of "abstraction" in OO design).

And, as I said in the last message, the approaches COMPLEMENT each other.

> Finally, "easy-to-use" shouldn't be a bad thing.  PHP, Mandrake Linux, and
> a host of other open source products take pride in being "easy-to-use".

Ah, that's where my complaint is here - as designed, the parser is ONLY easy
to use for END-USERS, but it's NOT so easy to use for DEVELOPERS(particular=
ly
CONTRIBUTING developers)...

There's nothing at ALL wrong with the GenePHP design as applies to end-user=
s.=20
I think in fact that it's going in exactly the right direction there,
abstracting the complicated stuff away so end-users don't have to worry abo=
ut
it.

> My guess is you're getting at the issue of "granularity" of modules.  You
> find GenePHP to have a "large granularity" (like rocks) instead of a
> "small/fine granularity" (like sand).  Did I read you right?

Essentially, yes.  Bearing in mind that also, at this time, I'm speaking
specifically only of the parser design (with the notion that the parser
design may have been indicative of future design of other areas.)

More specifically, though, I'm talking not so much about "granularity"=20
as "dependency" - right now, as designed, you cannot add a parser to GenePHP
without making it in such a way that it ONLY works in GenePHP.  This means
anyone who wants to contribute a parser (as an example) they have to=20
learn GenePHP structure, and write it from scratch to fit into the=20
GenePHP-specific form (it's not even an independent class, it's a function
that is "engulfed" into the parser class.  This makes it extremely messy,=20
if it's possible at all, to implement stream-based parsers or parsers for
"interleaved" data formats [e.g. clustal].)

> From what I understand, Sean seems to be suggesting this:
>
>       outputs                          outputs
>  parser ---> string, array ---> importer ---> Seq object,
>                                               other GenePHP objects
>
> There is value to that scheme from a team development point of view (i.e.
> Sean need not understand or use GenePHP objects).
>
> However, the alternative scheme (shown below) which Sean labels as
> "tight-coupling" is not inherently bad or flawed.
>
>       outputs
>  parser ---> object

Look at it this way - here's the way the design currently is:
(hopefully everyone's using a fixed-width font in their text email):

=46rom the end user perspective, it's nice and easy to use.  In "pseudo-cod=
e":

$parser =3D new parser("Some_File.fasta");
$seq =3D parser->next_sequence(); //making up names here...just illustratin=
g.

But from the perspective of a contributing developer:

seq Object  ---------------------------
=2D---------                            |
     ^                                |
    returns                           |
     ----Parser object <---         requires
               |          |         seq obj
               |          |           |
               |    returns seq obj   |-------------
               |          |           V            |
               fasta_memory_parser() function      |
               |                                   |
               genbank_memory_parser() function <--|
               |                                   |
               various_other_memory_parsers()<------


Now...what I am advocating means the end user does something like this:

$parser =3D new parser("Some_File.fasta");
$seq =3D parser->next_sequence(); //making up names here...just illustratin=
g.

Hmmm...exactly the same, isn't it.  Still easy to use.  But to the
contributing developer, it would be simply:

seq Object < --------
=2D----------         |
     ^             requires
  returns           |
     |          seq_object_factory Object <------
     |              |                           |
     |        passes seq obj back through       |
     |              |                           |
     |              |                   sends data to
     |              V                           |
     -------------------------------Parser Object
                                        ^
                                        |
                                 Instantiates and
                                   Gets data from
                                        |
                                        V
                                Fasta_Parse Object
                      (or Clustal_Parse Object or whatever)


NOW...say that later we decide the structure of the seq object needs
a complete overhaul.  All that needs to change is the seq_object
factory to match, and everything works again, because the parsers at the
"edge" of the web of the GenePHP system don't need to encapsulate
the seq object from the "center" of the web, and can focus entirely on
"getting the data out".  That is said to be one of the primary benefits of =
OO
design...

Note that the object NAMED "Parser" is STILL returning seq objects,=20
but the "Object doing the parsing" (out on the "edge", the interface
to the actual file/stream/string) is properly separated from the
task of "putting the data into the GenePHP format", which is handled
by another dedicated object.

This design ALSO means that if we later want to add support for some OTHER
object format, we need only add an "other_object_factory" object and tell
the Parser uber-object that it exists, rather than making a whole parallel
parser structure, or having to edit all of the parse-next-record functions..

At a far extreme, if some proprietary company offers to the public a
pre-compiled binary to extract data from their proprietary data format (but
doesn't want to reveal their magic secret data format), we CAN then support=
 it
if we want, by making the "ProprietaryFormat_Parser Object" be a simple
frontend, which can be developed INDEPENDENT of the rest of the project (so
that Proprietary Data Incorporated could even have one of their own develop=
ers
write the PHP wrapper and give it out to everyone, rather than having to wr=
ite
it specially for GenePHP.)

> If it were, then BioPerl and other OO BioXXX are flawed because their
> parsers return objects, instead of strings or arrays.  For instance,
> BioPerl's SeqIO parser/writer returns a data stream object.

That's because the SeqIO object is next to the "center" of the BioPerl=20
"web".  Note the reference within the genbank parsing part of the
system to "seqfactory", however - the actual object at the "edge" doing
the parsing isn't the part that generates the seq object:

http://doc.bioperl.org/releases/bioperl-1.2/Bio/SeqIO/genbank.html

It LOOKS like Bioperl turns out to be doing something like I'm advocating...
(I just looked this morning, so I wasn't sure until now :-) )

> If, for argument's sake, we were to follow Sean's logic to the extreme, a=
nd
> limit our parsers (and other functions) to return ONLY strings or arrays,
> then we would end up with a procedural and not an object-oriented BioPHP.=
=20
> That is not bad by itself, but such a construct should not claim to be OO.

Not at all, in fact I'm finding myself arguing that what I'm proposing (but
have probably been badly describing, due to lack of experience) is BETTER
OO design...If I actually had more experience at OO design I'd probably not
sound so indecisive about it...

> > paragraph.  Aren't 'standard arrays' and Genephp objects the same?  > I
> > mean the classes that Serge proposes are open to discussion (I hope)
>
> Yes, they are, you have my word on that.  =3D)

Well...no, but that's partly because "standard arrays" is probably=20
a really bad way for me to describe it.

Object have a more "rigid" and custom structure (and incorporate
methods).  What I've been (badly) trying to get across is that
I don't think that every single part of the "web" of GenePHP
should be required to work with only GenePHP objects, until they
get nearer the "center" of the web where the objects are "dedicated" to
GenePHP object manipulation...(hopefully the illustrations I put
above better explain what I'm trying to get across).

> >SC: I'm certain a "middle ground" can be worked out, but there isn't
> >much point in immediately "throwing together" both sets of code as >they
> > are now - the two sets of code are CURRENTLY approaching from >different
> > (somewhat incompatible) philosophies.
>
> I guess no one is opposed to the idea of merging the code base.  The
> disagreement lies in WHEN this should take place.  Nico wants it early,
> Sean wants it later. I have no problems with either, though I'm slightly
> leaning towards one camp.

Not even so much WHEN as WHETHER IT CAN.  Right this instant, it just
wouldn't accomplish much.  Most of what I have gets either "overwritten"
(because GenePHP already has the same type of object) or "made irrelevant"
(e.g., the parsers, because they cannot be made to fit).  Actually, Nico is
right, the ESearch and EFetch code is a new area, so IT can be dropped in w=
ith
GenePHP whenever we want to, but as for the rest, I have:

a sequence object for nucleotide sequences.
Overwritten.  This isn't a bad thing at all - my code isn't any "better
written" than the GenePHP seq object, and the GenePHP object has more
functionality already.

a "sequence list" class (for manipulating groups of sequences, including
alignments).
Irrelevant.  The existing SeqAlign class in GenePHP can easily be extended
to add the handful of functions for e.g. exporting the sequences to differe=
nt
file formats or collecting sequences from multiple places into a single
"list".  Again, not a bad thing.

Note that I am not the least bit bothered by those two objects being
"irrelevanted" - they are genuinely redundant, and the GenePHP versions
are currently more functional, so we should use them instead of my versions.
(NOTE: I WOULD, however, for my own education and future coding, be interes=
ted
in any comments people might have on how I was approaching those two object=
s,
though...)


a fasta_stream parser
disappears - it is not (and cannot "reasonably" be) written as a single=20
function that returns GenePHP seq objects.

a clustal parser
disappears - same as above, but more so (dealing with interleaved formats
seems particularly problematic in the current design).

It's not a matter of "you big meanies are picking on my work", but that
what I've got is, literally, mostly irrelevant to the current GenePHP
design and not mergeable in any way, so to me, the call AT PRESENT to merge
the codebases is nonsensical.

It's like the old joke about what you get when you combine IBM and Apple
(Answer: "IBM" :-) )

> Well, you shouldn't be reluctant to ask.  And if nobody listens, you can
> always rewrite the code yourself.  That's why it's called open source. =
=3D)

Well, okay, consider me asking as described above then :-)

I really don't like the idea of unnecessarily "forking" (or, rather,=20
"keeping forked") the work - we're all working on the same general
goal, and, it sounds like, even the same general philosophy (good OO design=
,=20
modules properly abstracted so as to be easy to work with, etc.).  Also,=20
as I look through the REST of GenePHP, it does NOT appear that the
rest is so glaringly "interdependent" - for example, the restriction enzyme
object returns normal strings from the cutting action (which means someone
can use it to just get data to paste into another program or document,
for example, without having to know about the seq object).  I would
advocate abstracting the seq object from it a bit more (i.e. have the seq=20
object "give" the sequence to the resten object as a string, rather than
the resten object having to "engulf" the seq object and pry the=20
sequence out of it as a string...but in the case of the resten object=20
someone who comes along and wants to add support for a new restriction enzy=
me,
for example, really doesn't need to know about the seq object themselves, so
it's less of an "ease of contributor use" issue than just a "where the prop=
er
abstraction line is" philosophical issue, which is less important.).

Also given that right now, even between both of the codebases, we're only
just getting started and there's really only three areas with anything
in them (parsers at one 'edge', EUtils interface at another "edge", and cus=
tom
seq objects and related seq object manipulators at the "center"), if my
notions of how the parser section might be better designed seem valid, then
there's not much work at all to get things merged Real Soon.

> But if "custom" means "it can't be used by other bioinformatics developers
> or applications", well, that certainly isn't the intention.

That's SORT OF what I meant.

I think my main problem is that I'm trying to get across that the parser
design currently "feels" very "closed in on itself" (not in the sense of
'closed source' but in the sense that it only works "within itself") and
that it doesn't seem like it's very well abstracted, while I'm trying to
learn and use (or, more accurately, learn BY using) "good OO design"...but
I really don't want to say that "it's not good OO design that way" since,=20
after all, if I KNEW for sure what good OO design was, I wouldn't be worryi=
ng
about learning it :-)  As it is, if I were to try to make statements on
what is and is not good OO design, I'd feel like an illiterate trying
to tell you all what makes a good book...

I was basically assuming that the "enclosed" style of the parser was on
purpose and indicative of the way the whole GenePHP system was intended
to be, which was not compatible with my own design goals.  It's really
starting to look/sound like this isn't really the case, and I'm just
being an overstressed idiot :-)

> Oh alright, just so we're not confused.  Here's my proposal:
>
> When referring to code, I'll use the term BiogenePHP or BgPHP to refer to
> the whatever common code base we can come up with in the future.  In this
> context, BioPHP and GenePHP are "branches" or "flavors" or "distributions"
> of BgPHP.
>
> As an analogy, BiogenePHP is Linux, BioPHP is RedHat Linux, while GenePHP
> is Mandrake Linux. The success of one should be the success of all. (See
> Fig. 1 way below.)

Well, this was sort of what I was suggesting with the "dual projects" comme=
nt,=20
but that was predicated on the notion that "GenePHP" and the code that I
was working on had different design goals.  This appears that it may
not be the case...

Presuming that it DOES turn out that we're both really going for the
same design goals, there's no other reason to keep them separate.

> I've downloaded your codes "manually" as I'm
> not registered as a developer here yet.  Sean, I think I've
> registered as dgregorio or d.gregorio but never mind this.
> I'll scrap it and re-enter as flipmozart, okay?)

I can do a search for dgregorio and d.gregorio if you've already got that
registered on bioinformatics.org if you'd prefer to use that, but it=20
doesn't matter to me.  Either way, I can get it into the system once
it's there.

> As projects, they are two separate projects with two separate websites,
> hosts (SF and bioinfo.org), organization, etc. and may be run in whatever
> manner as their members see fit.
>
> (One can be a "democratic state", another can be a "fascist state", etc.
> LOL)
[...]
> Figure 3.
>
>           BiogenePHP or BioPHP **
>
>   ** With headquarters located at One BiogenePHP Way,
>      Shuttle, Worseshington.  =3D)
>
> Again, this is *STILL* a proposal I've come up with to reflect current
> realities.
>
> Having said that, coming up with some common code base is a shared
> objective, and should be a priority.

ASSUMING that there actually isn't any conflict of design philosophy here, =
the
only thing that stands in the way of merging the codebase is the parser=20
design.  If you and Nico are willing to adjust the design of the parser to=
=20
abstract it more, then it becomes useful to merge in my two parsers as well.
(I am, of course, quite willing to help on coding that as well...)

The ESearch/ESummary modules, on the other hand, are a completely "new" area
and don't touch any of the other GenePHP sections, so we can import THAT=20
as is if you'd like, regardless of what happens to the parser design.

If we agree on the design philosophy, then I propose we go ahead and merge =
the
codebases as soon as the parser can allow it.  I propose to use "BioPHP" as=
 an
umbrella term to refer to the entirety of "biological data handling with PH=
P",
and BioGenePHP (or just GenePHP for short) for the specific portion that de=
als
with sequence-related/detailed genetic information (which I see as being th=
e=20
most important part, though not the ONLY part).  Later on, as we add
"BioGISPHP" (for environmental and population data and such), "BioChromaPHP"
(for dealing with LC Chromatograms), "BioMSPHP" (for dealing with mass
spectra), BioImagePHP (for dealing with E.G. images of "gene chips" and
gels) etc. etc. etc. they'll all be "BioPHP"...

As far as the websites, etc., we keep both - The GenePHP site stays as it i=
s,
keeping charge of the details of the (Bio)GenePHP project, while I modify
my "bioPHP" site to reflect that it is an "umbrella" or "portal" site that
encompasses references to all of the related BioPHP projects (and points to
GenePHP as the current focus of development).  We can keep the
CVS repository there as well as the mailing list (i.e. "biophp-dev" is for
all "biological data in PHP" issues).  If traffic grows large and varied
enough, we can set up additional "specific" mailing lists for the individual
parts of the overall project (which can be EITHER off of bioinformatics.org=
 OR
off of whatever site is hosting the "specific part of the project" page (i.=
e.
Sourceforge for GenePHP), though unless the concept becomes even more
wildly popular than I expect, I don't think the mailing list traffic will
get THAT heavy...

Thoughts?