[Biophp-dev] Abstracting the parser backend

S Clark biophp-dev@bioinformatics.org
Tue, 29 Apr 2003 22:29:03 -0600

On Tuesday 29 April 2003 08:59 pm, nicos@itsa.ucsf.edu wrote:

> No problem. I simply want it to work, to be as good as possible, and be
> available ASAP.

Thanks - I'm working on it now, starting with the fasta parser.

> OK.  So they will all have their own move_Next(), move_Previous, eof(),
> bof(), fetch() and probably also move_First(), move_Last(), move_To()
> functions?.  I guess the parser class constructors should take either a
> filename or a string as an argument.  Preferably, there should be a way
> to maintain and use an index in a file (as in Serge's seqdb class).
> Hmmm, this is all straight forward to do in memory (like it is now), but
> probably more difficult with a stream (what streams other than files are
> there in php?  php treats URLs almost exactly like a file, so...)  How
> important is it to deal with datastructures larger than the available
> memory?

I wouldn't have even thought of it before, except for the fact that I
once really DID parse through one of NCBI's multi-GB GenBank files
in order to create a file of ones matching particular criteria (it was 
also nice to be able to see the results as the file came down, rather
than having to wait for several hours while the file was downloaded first). 
That sort of thing, plus the occasional set of results from some
form of website query (or a system installed on an older machine using
GenePHP to redirect queries for multiple people at a time), is really
the only time I can think of when you NEED to differentiate between
stream-parsing and memory-based parsing.

The fact that I can treat an ftp or http url or socket like a filehandle
is one of the feature I abuse the heck out of in PHP...the only relevant
difference here is that there's no way to feasibly read back to the
previous record (without reading the whole stream into memory first), 
so the class at the "edge" of the system interacting with the stream
just needs to know to pass back "false" if asked to go back.

> B.t.w., since we now have this large list of methods that every class
> should have[...]
(I THINK I cover this further down...)

> > 2.add instantiation of the (filetype)_parser class
> Will this mean that I instantiate a class Parser, and class Parser finds
> the approriate (filetype)_parser class for me?  If so, it would be cool
> to keep the current include scheme, where only the required parser is
> actually included in the running script.

Yup - that is one of the things in particular that I really like about the
way you've got the Parser interface designed.  The end-user shouldn't
have to concern him/her/itself with the different classes way out on
the "edge" of the system.  I'm trying change as little as possible
about the way you've got the Parser object designed to work - just
moving the parsing functions down to more abstracted classes.

The way that section of code reads (not yet tested) at the moment in
my version is:

if ($this->seqfiletype) {
eval("\$this->parserObj = new parse_".$this->seqfiletype."(\$this->flines);");
			$this->func= & $this->parserObj->fetch();
			// only include the parser we will need
			// to get the indexing right we'll have to fetch one first
		} else {// if we don't know the seqfiletype, destruct the object:

> so fetch(),will simple call (filetype)_parser->fetch()?
> so fetch() will get a datastructure from the parser, feed this to object
> seq_factory,and get a Seq object back, which it sends back to the calling
> party?

Exactly so - on the off chance that anyone's already written a bunch with 
the existing parser in the last few days, their code should CONTINUE to
work exactly the same - just "inside" there are couple of additional
objects running.

The datastructure I'm using is just an array of key=>value pairs, with
the keys being the names of the variables in the seq object (they seem to
be named such that it's pretty obvious what they all are, so if someone
wanted to use the data outside of GenePHP it'd be easy for them to figure
out what it meant.

(Example - "id"=>"somesequence","sequence"=>"AGCT","seqlength"=>4 [etc.])

Parse::fetch() calls the $parseObj->fetch(); method, and passes that to
$seqfactory->createSeq($fetchedArray), returning the resulting sequence.

(if desirable later, it'd be trivial to add an argument for Parse->fetch()
to cause it to return the data array directly rather than a seq object.  Don't
know how much that would be in demand, though.)

Would it be helpful to add a "fetchNext()" wrapper function to the Parse
class (that would "fetch()" and then "moveNext()")?

> Sounds OK to me.  I would still think about keeping the user functions
> (move_Next(), etc..) in the Parse class (provided it is possible, I don't
> think interleaved data are a big problem with this scheme).  That will
> keep the individual parsers more simple.

I agree, that'll keep it easy to use.  Those calls in the parse class
become "wrappers" for the equivalent calls down in the individual
file parsers, so they'll continue to operate the way you wrote them, 
at least presuming I do it properly.

> The seq_factory is fine with me (it is a good idea to keep the
> translation from what is in the file to our abstraction of the real world
> in one place).
> The only issue is how to deal with stuff both in memory (small files,
> strings, these are currently kept in an array of strings and an index of
> the line-numbers with sequence entries is maintained) and in streams (big
> files, we read from a file pointer).  Is there an easy way to deal with
> both?

Fortunately, two things SHOULD be true about parsing streams:
1)people should not have to do it too often 
2)when they do, they will conceivably know in advance what
kind of stream it is, and so will be able to specify the
filetype. (auto-detection should really be the only BIG
difference in use between streams and files/strings).

In order to keep the upper-level Parse object unified, I'd say simply
have the stream parsers return "false" for calls to "move_Previous()".
In most cases, I suspect when reading a stream the user will just be wanting
to extract some or all of the sequences in it as they come in, so this should
be no issue.

The other alternative that comes to mind is having a separate parse_stream 
object, but that kind of defeats some of the usefulness of the unified Parse
design (and still doesn't change the fact that you can't easily "rewind"...)

I probably won't have it finished tonight, but sometime tomorrow I should
have a working example to put up somewhere for evaluation before it goes
into the official tree.