[Biophp-dev] Genephp and Biophp (parser issues, and "roadmap?")

Mon, 28 Apr 2003 22:15:31 -0600

On Monday 28 April 2003 07:02 pm, Nico Stuurman wrote:
>  However, I don't
> completely understand the idea of modules that return arrays and
> strings.  I am talking from the point of view of an application
> programmer (and I do think that both you and Serge target application
> programmers).  For parsers, for instance, I would want them to return
> the same data structure independent of the format I put into them.
> Also, the parsers should have the same API independent of the data they
> are parsing.  Otherwise, it will become a pretty confusing bunch of
> scripts.  To me, it seems quite logical to use a class as the output if
> the parsers, but the only thing that is important is that every parser
> returns the same datastructure and that they all have the same
> behavior.  The parse class that is now in the biophp cvs (under
> genephp) does just that (and you can easily use that class - with the
> seq class - without the rest of the genephp classes.

What I meant about "standard arrays and strings" is the notion that, 
instead of having dependency on an additional class (for example) just
to parse a file (the seq class), devise the components to export to and 
import from each other instead.  As an example, there are really only
two pieces of information for each sequence in a FASTA file - a "label"
and the sequence itself.  If, instead of automatically forcing the
data directly into a seq object, the parser returned an array
("label"=>"Some_Sequence","sequence"=>"AAGGCCTT" - or whatever "standard"
keys we decide on), and the seq object instead added a method to IMPORT
its data via this array format, there wouldn't need to be much change, 
but a parser could STILL stand all on its own, rather than also needing
to be designed to know about the Parser object and the Seq object.

(If Joe Schmoe of Blurgle, Incorporated popped onto the list one day
and said "Hey, I wrote a PHP class for parsing our proprietary BioBlurgle
package's data files, and the boss said I could give it to you", it'd then
likely take only a slight adjustment to the existing code to make it "GenePHP
compatible", rather than having to say "That's nice, please re-write it to
compliance fit into the Parser object as a single function and export Seq
objects" or "Thanks, when I get done with what I'm working on I'll re-write
it to compliance with the Parser object as a single function and make
it return Seq objects", we could "drop it in" with almost no additional
effort.

Conversely, someone who just wants to slap together a "combine several clustal
alignment files into one file" script (no, don't ask me why, it's just an 
example :-) ) they need only grab the "Parser_Clustal_class.php" file and 
go, rather than either figuring out which sets of files they need (3?  The seq
class, the parser class, and the clustal parser function file?) or grab the
entire GenePHP/BioPHP tree and install it to make sure they get them all....

Bear in mind also that my impression comes mostly from the current
design of the GenePHP parser system, which is CURRENTLY very co-dependent
on three different sets of "custom" files (seq class, parser class, and the 
sets of single-function parsers) - I looked at these for comparison
with what I was doing since that was one of the two areas where I had
written something in an area where GenePHP also had.  If this sort of
interdependency isn't as prevalent in the rest of the current design
intentions, then it's not such an issue.

In the specific area of the parser, perhaps instead of having it designed
such that all of the actual parsers need to be single functions that get
"pulled" into the GenePHP parser class (and still depending on the seq
class), the individual parsers might be classes of their own, and the
GenePHP parser class calls them and imports data from them (and IT
generates seq objects rather than the individual parser modules, while on
the other hand the file/stream/data I/O gets handled by the
parser module instead of the parser "uber-class"...) , that would
better "modularize" the parsers (AND make it easier to support
both memory-based and stream-based parsers through the same GenePHP
interface) without losing either the "integration" with GenePHP NOR
losing the ability to separate the parser back out to use it alone (and
the parser needs only to know what to "name" the data in the array that
is passes back).

> I don't think so, your esearch modules should be immediately useful
> (now cvs works, I'll have a look at it soon).

Other than the obvious factor that I need to move the individual XML
parsing extended classes out to a separate file (one of the requirements
mentioned in the auto-documenter you linked to earlier...just a simple
cut-paste-save-and replace with a "require_once()" call will take
care of that) let me know what you think of the design (and, for that
matter, the readability of the code and such).  Next will probably
be either the URLAPI interface to the online BLAST databases or beginning
work on the variety of EFetch interfaces (which look like they'll all 
need to be fairly different between each type of database).

> We will all sometimes write code that
> becomes superfluous, but the fun thing here is that we can learn from
> each other and help each other and make something that is much better
> than when we were doing it alone.  Sometimes it can hurt a little, but
> if we all try work together it will be much more rewarding in the long
> run.

Okay, okay, I should say right now that I *hope* I'm not coming across
in a "fine, you big meanies, I'll just take my ball and go home, so there!"
tone.  If so, that's NOT my intention by any means  (If I am, I give you and
everyone else on the list my explicit permission to make snide comments at me
until I stop :-) )

And If I just sound "cranky", well, it's been a very stressful month, but
I reiterate my permission in this case to make snide comments at me until
I cheer up...

> I think that independent programmers will be very thankful for simple,
> well designed data-structures that they can use without fuss in their
> programs.  I completely agree with you that we should work towards
> small self-contained modules that can also be used as much as possible
> without needing everything and the kitchen sink.  However, if you don't
> structure the data that are coming in, then why would an application
> programmer even bother to use your code?
> As an example, you can now throw 4 different types of sequence data
> file formats at the parse class (as a file or as a string), and it will
> return the same kind of data structure for each of them.  The only
> thing I have to do in my phplabware project is to write the class  to
> the structure I have in my SQL database.

I comment above on what I mean in regards to the parsers (the individual
parser "modules" being required to fit in as a single function and being
co-dependent on the parser uber-class and the seq class for their
existence)...

I don't mean to use "unstructured" data, merely data that is passed in a
form that CAN (not MUST, just CAN) be used easily on its own (if someone
had an irrational hatred of the GenePHP seq class [again, I have no idea
WHY, it's just an example] they're out of luck as currently designed if
they want to use the parsers, whereas if the parsers pass the data
back as an "ordinary" array (with standardized key names) the parser
could be used directly without having to re-write it to 'divorce' it
from the GenePHP class.  They could just "drop in" their own class
to structure the data being sent by the parsers however they want...)

> It will be easiest if we all decide on the underlying datastructures we
> are going to use (and Serge is doing a great job there, even though I
> don't agree with everything he suggests).  Once the datastructures are
> in place (and they are good), programming will be easy (actually, the
> better the datastructures, the easier the programming).

I am in complete agreement with all of that, for the record - Really I think
the the only thing under discussion here is exactly where the balance
of "portable" vs. "integrated" intersect to reach the "peak" of "better
data structures" :-)

> B.t.w I just read the term 'standard arrays' in your previous
> paragraph.  Aren't 'standard arrays' and Genephp objects the same?  I
> mean the classes that Serge proposes are open to discussion (I hope)
> and classes are nothing more than arrays to which you can add
> functions.  So, aren't we talking about the same thing?

Yes and no - I just mean that (to put it another way) modules on the "edge"
of the GenePHP/BioPHP System (where it touches the metaphorical "rest of
the world", e.g. external databases and file formats) ought not to be
tightly tied to and dependent on the structures at the "core" of the system 
(where everything is converted to e.g. seq objects)...instead of being
required to CREATE (and therefore know about) seq objects, I'm advocating
that the parsers instead PASS DATA TO seq objects (or to an object that
in turn churns out seq objects - is that the proper meaning of "factory" 
in OO programming terms?)  In purely SELFISH terms (only in open-source
could someone describe demanding to be able to give something away as
"selfish" :-) ), if the parser system accepted "normal data" (i.e. a simple
array structure with agreed-upon key terms rather than a custom class) from
the object reading data from the outside world, it would take only a trivial
amount of adjustment for me to add add MY fasta parser (as "fasta_stream"
perhaps, to differentiate from the memory-based parser) and clustal .aln
parsers to the framework...and the same will be true of anyone who has
written parsers elsewhere outside of GenePHP/BioPHP who comes along
and wants to contribute.  As it is NOW, both of those parsers get scrapped -
there's no way to fit them in the existing framework, and they'd need to be
completely re-written (well, the clustal alignment parser would - I'm not
sure it'd even be possible to do a stream-based parser in the existing
setup - there's no good way to "rewind" back to a previous entry unless you
load the whole thing into memory first, which isn't necessarily desirable
[e.g. when grep'ing through NCBI's GenBank data to separate out a particular
type of sequence - who wants to read 2GB+ of data into RAM first?])

> Again, I hope we can do this all together

THAT was never in question - even at the extreme of having two "different" 
projects (which I don't see as being necessary) we'd still be one group
with one goal...just two different (but cross-pollinating) approaches,
somewhat like Xine and MPlayer.  Well, okay, more cohesive that THAT, but
you get the idea...

I was getting the impression from the design of the parsers that the goal
with GenePHP was extremely tight integration (to the extent that it
conflicted with my goal of extremely modular design).  It SOUNDS like my
impression is incorrect (similar to how the impression I seem to be giving of
MY goals appears to be more extreme than it really is), so this is not really
an issue, and we're merely discussing design philosophy at this point, and as
always, the end product will be somewhere in the middle ("between 'live free
or die' and 'famous potatoes'", as George Carlin once put it, describing state
license plate mottos...this always sticks in my head, since I'm now LIVING in
the "famous potatoes" one :-) [at least it's a lot cheaper and less crowded
than the "SunShine state" where I lived before, but at least I could get
decent Biotech education there...])

SO...

maybe the NEXT question is, aside from the parser design discussion, what
should we consider to be the level and type of functionality that needs to
be working for GenePHP/BioPHP's first formal Alpha release? :-)

(Discussion of where "BioGISPHP", "ChromatoPHP", and "MassSpecPHP" fit into
all this can wait until much later...)