[Biophp-dev] Egad, this is getting long :-)

Nico Stuurman biophp-dev@bioinformatics.org
Tue, 29 Apr 2003 13:28:47 -0700


OK.  Here I am, interested in using php for bioinformatics projects, 
especially interested in getting code to parse different sequence file 
formats that are going around in the lab, find the Biophp site, find no 
code, ask on the mailing list, get code from another project called 
GenePHP, find that the parsers are too encapsulated to be useful 
outside of the class they are functioning, spend (to the dismay of my 
wife) part of the weekend to abstract them and make them useful as 
independent units, and then...
- get long rants about how the new parser class signifies the evil 
tendencies of GenePHP to have Window-like properties...???

Come on Sean!  I think you are getting at a good point, but please work 
on your communication skills and think a bit before you write!

To summarize what I think (after endless reading) are Sean's ideas:
1. At the heart of the parsing class should be parser functions (one 
for each file type), that can be asked to return the current sequence, 
the next sequence, the previous sequence, the first and last sequence.  
They return the sequence in a dataformat (array?) that is different for 
each parser.  The parsers can take filenames, filepointers and strings 
as arguments.  They can do the parsing in memory or disk based.  They 
can use index files - if available - to work with large datasets.
2. At the next level is a class that 'translates' the return value of 
the parsers into a 'Seq' object.
3. Somewhere (at the top?) is a Parsers class that 'sniffs' the 
argument passed to it and sends it to the right parser.

Not a bad idea.  The current design has some of the functionality of 1 
moved to 3 (which avoids having to rewrite it fit every parsers, but 
might lead to situations where a parser cannot do what is supposed to 
be doing).  It has point 2 integrated in 1.

Is this about it?

Sean, can you please try to be a little bit more to the point?


best,

Nico



On Tuesday, April 29, 2003, at 12:37 PM, S Clark wrote:

> However, HOPEFULLY this message will manage to articulate what
> I've not been articulating very well up to this point...
>
> On Tuesday 29 April 2003 02:20, Serge Gregorio wrote:
>> Hello all!
>>
>>> SC: It just seems we have two different approaches going on at the 
>>> same
>>> time - with GenePHP it has more of a "Windows" type approach of tying
>>> things together into larger sets of easy-to-use segments ... while I 
>>> have
>>> been going for more of a "Unix" type of approach of lots of small,
>>> independent modules ...
>>
>> I think the analogy is not accurate as Windows is more than just 
>> large sets
>> of easy to use segments.  It is a CLOSED system.  GenePHP and its 
>> code is
>> very much "open".
>
> Obviously true, I was referring only to the "design style" not the
> moral philosophy :-)
>
>> Having said that, I suggest we avoid making broad,
>> sweeping comments like the above, and be more specific like below.  
>> Using
>> labels like "monolithic" is acceptable, but controversial labels like
>> "Windows" (synonymous to "Nazi" for some) must be avoided. We dont' 
>> want to
>> drive away Windows developers from the project, do we?
>
> Sheesh!  I didn't realize even WINDOWS users thought "windows" was a 
> bad word!
> Normally I'd make a joke at this point about how "of course we don't
> wany any stinky old windows users" (pretending that I don't know that 
> YOU are
> a windows user), but given my communication success record the last 
> couple of
> days someone would probably think I was serious, so I won't :-)
>
> I was referring not at all to anything specifically BAD about windows 
> design,
> only that windows programmers seem to always tend towards putting 
> everything
> inside of one application, while unix programmers tend towards chaining
> lots of little applications together (somewhat analogous to my 
> rudimentary
> understanding of "abstraction" in OO design).
>
> And, as I said in the last message, the approaches COMPLEMENT each 
> other.
>
>> Finally, "easy-to-use" shouldn't be a bad thing.  PHP, Mandrake 
>> Linux, and
>> a host of other open source products take pride in being 
>> "easy-to-use".
>
> Ah, that's where my complaint is here - as designed, the parser is 
> ONLY easy
> to use for END-USERS, but it's NOT so easy to use for 
> DEVELOPERS(particularly
> CONTRIBUTING developers)...
>
> There's nothing at ALL wrong with the GenePHP design as applies to 
> end-users.
> I think in fact that it's going in exactly the right direction there,
> abstracting the complicated stuff away so end-users don't have to 
> worry about
> it.
>
>> My guess is you're getting at the issue of "granularity" of modules.  
>> You
>> find GenePHP to have a "large granularity" (like rocks) instead of a
>> "small/fine granularity" (like sand).  Did I read you right?
>
> Essentially, yes.  Bearing in mind that also, at this time, I'm 
> speaking
> specifically only of the parser design (with the notion that the parser
> design may have been indicative of future design of other areas.)
>
> More specifically, though, I'm talking not so much about "granularity"
> as "dependency" - right now, as designed, you cannot add a parser to 
> GenePHP
> without making it in such a way that it ONLY works in GenePHP.  This 
> means
> anyone who wants to contribute a parser (as an example) they have to
> learn GenePHP structure, and write it from scratch to fit into the
> GenePHP-specific form (it's not even an independent class, it's a 
> function
> that is "engulfed" into the parser class.  This makes it extremely 
> messy,
> if it's possible at all, to implement stream-based parsers or parsers 
> for
> "interleaved" data formats [e.g. clustal].)
>
>> From what I understand, Sean seems to be suggesting this:
>>
>>       outputs                          outputs
>>  parser ---> string, array ---> importer ---> Seq object,
>>                                               other GenePHP objects
>>
>> There is value to that scheme from a team development point of view 
>> (i.e.
>> Sean need not understand or use GenePHP objects).
>>
>> However, the alternative scheme (shown below) which Sean labels as
>> "tight-coupling" is not inherently bad or flawed.
>>
>>       outputs
>>  parser ---> object
>
> Look at it this way - here's the way the design currently is:
> (hopefully everyone's using a fixed-width font in their text email):
>
> From the end user perspective, it's nice and easy to use.  In 
> "pseudo-code":
>
> $parser = new parser("Some_File.fasta");
> $seq = parser->next_sequence(); //making up names here...just 
> illustrating.
>
> But from the perspective of a contributing developer:
>
> seq Object  ---------------------------
> ----------                            |
>      ^                                |
>     returns                           |
>      ----Parser object <---         requires
>                |          |         seq obj
>                |          |           |
>                |    returns seq obj   |-------------
>                |          |           V            |
>                fasta_memory_parser() function      |
>                |                                   |
>                genbank_memory_parser() function <--|
>                |                                   |
>                various_other_memory_parsers()<------
>
>
> Now...what I am advocating means the end user does something like this:
>
> $parser = new parser("Some_File.fasta");
> $seq = parser->next_sequence(); //making up names here...just 
> illustrating.
>
> Hmmm...exactly the same, isn't it.  Still easy to use.  But to the
> contributing developer, it would be simply:
>
> seq Object < --------
> -----------         |
>      ^             requires
>   returns           |
>      |          seq_object_factory Object <------
>      |              |                           |
>      |        passes seq obj back through       |
>      |              |                           |
>      |              |                   sends data to
>      |              V                           |
>      -------------------------------Parser Object
>                                         ^
>                                         |
>                                  Instantiates and
>                                    Gets data from
>                                         |
>                                         V
>                                 Fasta_Parse Object
>                       (or Clustal_Parse Object or whatever)
>
>
> NOW...say that later we decide the structure of the seq object needs
> a complete overhaul.  All that needs to change is the seq_object
> factory to match, and everything works again, because the parsers at 
> the
> "edge" of the web of the GenePHP system don't need to encapsulate
> the seq object from the "center" of the web, and can focus entirely on
> "getting the data out".  That is said to be one of the primary 
> benefits of OO
> design...
>
> Note that the object NAMED "Parser" is STILL returning seq objects,
> but the "Object doing the parsing" (out on the "edge", the interface
> to the actual file/stream/string) is properly separated from the
> task of "putting the data into the GenePHP format", which is handled
> by another dedicated object.
>
> This design ALSO means that if we later want to add support for some 
> OTHER
> object format, we need only add an "other_object_factory" object and 
> tell
> the Parser uber-object that it exists, rather than making a whole 
> parallel
> parser structure, or having to edit all of the parse-next-record 
> functions..
>
> At a far extreme, if some proprietary company offers to the public a
> pre-compiled binary to extract data from their proprietary data format 
> (but
> doesn't want to reveal their magic secret data format), we CAN then 
> support it
> if we want, by making the "ProprietaryFormat_Parser Object" be a simple
> frontend, which can be developed INDEPENDENT of the rest of the 
> project (so
> that Proprietary Data Incorporated could even have one of their own 
> developers
> write the PHP wrapper and give it out to everyone, rather than having 
> to write
> it specially for GenePHP.)
>
>> If it were, then BioPerl and other OO BioXXX are flawed because their
>> parsers return objects, instead of strings or arrays.  For instance,
>> BioPerl's SeqIO parser/writer returns a data stream object.
>
> That's because the SeqIO object is next to the "center" of the BioPerl
> "web".  Note the reference within the genbank parsing part of the
> system to "seqfactory", however - the actual object at the "edge" doing
> the parsing isn't the part that generates the seq object:
>
> http://doc.bioperl.org/releases/bioperl-1.2/Bio/SeqIO/genbank.html
>
> It LOOKS like Bioperl turns out to be doing something like I'm 
> advocating...
> (I just looked this morning, so I wasn't sure until now :-) )
>
>> If, for argument's sake, we were to follow Sean's logic to the 
>> extreme, and
>> limit our parsers (and other functions) to return ONLY strings or 
>> arrays,
>> then we would end up with a procedural and not an object-oriented 
>> BioPHP.
>> That is not bad by itself, but such a construct should not claim to 
>> be OO.
>
> Not at all, in fact I'm finding myself arguing that what I'm proposing 
> (but
> have probably been badly describing, due to lack of experience) is 
> BETTER
> OO design...If I actually had more experience at OO design I'd 
> probably not
> sound so indecisive about it...
>
>>> paragraph.  Aren't 'standard arrays' and Genephp objects the same?  
>>> > I
>>> mean the classes that Serge proposes are open to discussion (I hope)
>>
>> Yes, they are, you have my word on that.  =)
>
> Well...no, but that's partly because "standard arrays" is probably
> a really bad way for me to describe it.
>
> Object have a more "rigid" and custom structure (and incorporate
> methods).  What I've been (badly) trying to get across is that
> I don't think that every single part of the "web" of GenePHP
> should be required to work with only GenePHP objects, until they
> get nearer the "center" of the web where the objects are "dedicated" to
> GenePHP object manipulation...(hopefully the illustrations I put
> above better explain what I'm trying to get across).
>
>>> SC: I'm certain a "middle ground" can be worked out, but there isn't
>>> much point in immediately "throwing together" both sets of code as 
>>> >they
>>> are now - the two sets of code are CURRENTLY approaching from 
>>> >different
>>> (somewhat incompatible) philosophies.
>>
>> I guess no one is opposed to the idea of merging the code base.  The
>> disagreement lies in WHEN this should take place.  Nico wants it 
>> early,
>> Sean wants it later. I have no problems with either, though I'm 
>> slightly
>> leaning towards one camp.
>
> Not even so much WHEN as WHETHER IT CAN.  Right this instant, it just
> wouldn't accomplish much.  Most of what I have gets either 
> "overwritten"
> (because GenePHP already has the same type of object) or "made 
> irrelevant"
> (e.g., the parsers, because they cannot be made to fit).  Actually, 
> Nico is
> right, the ESearch and EFetch code is a new area, so IT can be dropped 
> in with
> GenePHP whenever we want to, but as for the rest, I have:
>
> a sequence object for nucleotide sequences.
> Overwritten.  This isn't a bad thing at all - my code isn't any "better
> written" than the GenePHP seq object, and the GenePHP object has more
> functionality already.
>
> a "sequence list" class (for manipulating groups of sequences, 
> including
> alignments).
> Irrelevant.  The existing SeqAlign class in GenePHP can easily be 
> extended
> to add the handful of functions for e.g. exporting the sequences to 
> different
> file formats or collecting sequences from multiple places into a single
> "list".  Again, not a bad thing.
>
> Note that I am not the least bit bothered by those two objects being
> "irrelevanted" - they are genuinely redundant, and the GenePHP versions
> are currently more functional, so we should use them instead of my 
> versions.
> (NOTE: I WOULD, however, for my own education and future coding, be 
> interested
> in any comments people might have on how I was approaching those two 
> objects,
> though...)
>
>
> a fasta_stream parser
> disappears - it is not (and cannot "reasonably" be) written as a single
> function that returns GenePHP seq objects.
>
> a clustal parser
> disappears - same as above, but more so (dealing with interleaved 
> formats
> seems particularly problematic in the current design).
>
> It's not a matter of "you big meanies are picking on my work", but that
> what I've got is, literally, mostly irrelevant to the current GenePHP
> design and not mergeable in any way, so to me, the call AT PRESENT to 
> merge
> the codebases is nonsensical.
>
> It's like the old joke about what you get when you combine IBM and 
> Apple
> (Answer: "IBM" :-) )
>
>> Well, you shouldn't be reluctant to ask.  And if nobody listens, you 
>> can
>> always rewrite the code yourself.  That's why it's called open 
>> source. =)
>
> Well, okay, consider me asking as described above then :-)
>
> I really don't like the idea of unnecessarily "forking" (or, rather,
> "keeping forked") the work - we're all working on the same general
> goal, and, it sounds like, even the same general philosophy (good OO 
> design,
> modules properly abstracted so as to be easy to work with, etc.).  
> Also,
> as I look through the REST of GenePHP, it does NOT appear that the
> rest is so glaringly "interdependent" - for example, the restriction 
> enzyme
> object returns normal strings from the cutting action (which means 
> someone
> can use it to just get data to paste into another program or document,
> for example, without having to know about the seq object).  I would
> advocate abstracting the seq object from it a bit more (i.e. have the 
> seq
> object "give" the sequence to the resten object as a string, rather 
> than
> the resten object having to "engulf" the seq object and pry the
> sequence out of it as a string...but in the case of the resten object
> someone who comes along and wants to add support for a new restriction 
> enzyme,
> for example, really doesn't need to know about the seq object 
> themselves, so
> it's less of an "ease of contributor use" issue than just a "where the 
> proper
> abstraction line is" philosophical issue, which is less important.).
>
> Also given that right now, even between both of the codebases, we're 
> only
> just getting started and there's really only three areas with anything
> in them (parsers at one 'edge', EUtils interface at another "edge", 
> and custom
> seq objects and related seq object manipulators at the "center"), if my
> notions of how the parser section might be better designed seem valid, 
> then
> there's not much work at all to get things merged Real Soon.
>
>> But if "custom" means "it can't be used by other bioinformatics 
>> developers
>> or applications", well, that certainly isn't the intention.
>
> That's SORT OF what I meant.
>
> I think my main problem is that I'm trying to get across that the 
> parser
> design currently "feels" very "closed in on itself" (not in the sense 
> of
> 'closed source' but in the sense that it only works "within itself") 
> and
> that it doesn't seem like it's very well abstracted, while I'm trying 
> to
> learn and use (or, more accurately, learn BY using) "good OO 
> design"...but
> I really don't want to say that "it's not good OO design that way" 
> since,
> after all, if I KNEW for sure what good OO design was, I wouldn't be 
> worrying
> about learning it :-)  As it is, if I were to try to make statements on
> what is and is not good OO design, I'd feel like an illiterate trying
> to tell you all what makes a good book...
>
> I was basically assuming that the "enclosed" style of the parser was on
> purpose and indicative of the way the whole GenePHP system was intended
> to be, which was not compatible with my own design goals.  It's really
> starting to look/sound like this isn't really the case, and I'm just
> being an overstressed idiot :-)
>
>> Oh alright, just so we're not confused.  Here's my proposal:
>>
>> When referring to code, I'll use the term BiogenePHP or BgPHP to 
>> refer to
>> the whatever common code base we can come up with in the future.  In 
>> this
>> context, BioPHP and GenePHP are "branches" or "flavors" or 
>> "distributions"
>> of BgPHP.
>>
>> As an analogy, BiogenePHP is Linux, BioPHP is RedHat Linux, while 
>> GenePHP
>> is Mandrake Linux. The success of one should be the success of all. 
>> (See
>> Fig. 1 way below.)
>
> Well, this was sort of what I was suggesting with the "dual projects" 
> comment,
> but that was predicated on the notion that "GenePHP" and the code that 
> I
> was working on had different design goals.  This appears that it may
> not be the case...
>
> Presuming that it DOES turn out that we're both really going for the
> same design goals, there's no other reason to keep them separate.
>
>> I've downloaded your codes "manually" as I'm
>> not registered as a developer here yet.  Sean, I think I've
>> registered as dgregorio or d.gregorio but never mind this.
>> I'll scrap it and re-enter as flipmozart, okay?)
>
> I can do a search for dgregorio and d.gregorio if you've already got 
> that
> registered on bioinformatics.org if you'd prefer to use that, but it
> doesn't matter to me.  Either way, I can get it into the system once
> it's there.
>
>> As projects, they are two separate projects with two separate 
>> websites,
>> hosts (SF and bioinfo.org), organization, etc. and may be run in 
>> whatever
>> manner as their members see fit.
>>
>> (One can be a "democratic state", another can be a "fascist state", 
>> etc.
>> LOL)
> [...]
>> Figure 3.
>>
>>           BiogenePHP or BioPHP **
>>
>>   ** With headquarters located at One BiogenePHP Way,
>>      Shuttle, Worseshington.  =)
>>
>> Again, this is *STILL* a proposal I've come up with to reflect current
>> realities.
>>
>> Having said that, coming up with some common code base is a shared
>> objective, and should be a priority.
>
> ASSUMING that there actually isn't any conflict of design philosophy 
> here, the
> only thing that stands in the way of merging the codebase is the parser
> design.  If you and Nico are willing to adjust the design of the 
> parser to
> abstract it more, then it becomes useful to merge in my two parsers as 
> well.
> (I am, of course, quite willing to help on coding that as well...)
>
> The ESearch/ESummary modules, on the other hand, are a completely 
> "new" area
> and don't touch any of the other GenePHP sections, so we can import 
> THAT
> as is if you'd like, regardless of what happens to the parser design.
>
> If we agree on the design philosophy, then I propose we go ahead and 
> merge the
> codebases as soon as the parser can allow it.  I propose to use 
> "BioPHP" as an
> umbrella term to refer to the entirety of "biological data handling 
> with PHP",
> and BioGenePHP (or just GenePHP for short) for the specific portion 
> that deals
> with sequence-related/detailed genetic information (which I see as 
> being the
> most important part, though not the ONLY part).  Later on, as we add
> "BioGISPHP" (for environmental and population data and such), 
> "BioChromaPHP"
> (for dealing with LC Chromatograms), "BioMSPHP" (for dealing with mass
> spectra), BioImagePHP (for dealing with E.G. images of "gene chips" and
> gels) etc. etc. etc. they'll all be "BioPHP"...
>
> As far as the websites, etc., we keep both - The GenePHP site stays as 
> it is,
> keeping charge of the details of the (Bio)GenePHP project, while I 
> modify
> my "bioPHP" site to reflect that it is an "umbrella" or "portal" site 
> that
> encompasses references to all of the related BioPHP projects (and 
> points to
> GenePHP as the current focus of development).  We can keep the
> CVS repository there as well as the mailing list (i.e. "biophp-dev" is 
> for
> all "biological data in PHP" issues).  If traffic grows large and 
> varied
> enough, we can set up additional "specific" mailing lists for the 
> individual
> parts of the overall project (which can be EITHER off of 
> bioinformatics.org OR
> off of whatever site is hosting the "specific part of the project" 
> page (i.e.
> Sourceforge for GenePHP), though unless the concept becomes even more
> wildly popular than I expect, I don't think the mailing list traffic 
> will
> get THAT heavy...
>
> Thoughts?
> _______________________________________________
> Biophp-dev mailing list
> Biophp-dev@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biophp-dev
>