[Biophp-dev] Parser object - streams and files

S Clark biophp-dev@bioinformatics.org
Wed, 30 Apr 2003 12:23:42 -0600


On Wednesday 30 April 2003 11:42 am, Nico Stuurman wrote:
> If I understand it correctly, you now want to the Parse class to
> remember the sequence entries that have been read by the parser
> functions (instead of just remembering pointers to positions in the
> input).  When dealing with strings (data in memory) this has the
> (slight) disadvantage that you are doubling the memory usage.  When
> dealing with streams, it has the advantage that you can easily move
> back.  However, doesn't fseek() let you put the file pointer backwards
> (don't know how php handles this, does it re-stream the stream or is it
> intelligent and keeps a cache somewhere that you can dip in?).

fseek() does let you move around in the file, but as far as I have
been able to find out, the only way to approximate "reading backwards"
in a "stream" would be to close the stream and re-open it from the start - 
which works for web pages and ftp, but won't work at all for, e.g. stdin.
As far as I know there's no buffering that you can go back in.

In effect, having the "stack" of results at the upper-level of the parser
serves as the buffer that's not there for streams :-)

That does still leave the system using up to 2x the memory, roughly, at
its peak for the data, though this can be mitigated somewhat by having the
filetype parser unset() its data once it's parsed the last record from it and
passed the results up to the Parser object.

> How are the parser functions going to deal with streams versus arrays?
> It would be nice if that could be abstracted for them at a higher
> level, so that not every parsers has to have code for streams and
> arrays (yesterday night's idea to abstract the array functions 'next,
> 'current', 'each', and 'previous', so that they can be used for either
> arrays or streams was getting at that point.

What I'm planning is that the individual "file"(or stream)type parsers
will all be individual, so there will in the end be two similar but
different parser classes for the handful of datatypes that are commonly
retrieved as either a stream or a file (e.g. fasta and fasta_stream, genbank
and genbank_stream, etc.).

Since there's no way to auto-detect streams realistically (without
reading them entirely into memory), the user will HAVE to specify a
stream-parser if they want it.  Fortunately, this won't affect MOST
people, since the filetype parser itself will only need to move
forward through the records and never backwards.  As long as 
the stream isn't excessively large, since PHP deals with
all "filehandle" resources the same (even if they're really streams)
the memory-based version of the parsers will still work with
auto-detection even if given a stream.  That is, as long as
the user doesn't try to feed the parser one of the aforementioned
2gb files or streams.

In practice, what this means is that if the user KNOWS he's about to
read a really really big set of data, THEN he'll specify "new
Parser('fasta_stream')" (or whatever), otherwise, regardless of whether
the actual data is on a web page, being piped from stdin, or in a normal
file, it'll get read entirely into memory and auto-detected.

The end-user will see no change in what happens when they tell
the parser to go back and forth and fetch records, but instead of
next/previous going back and forth through the original data and re-parsing
it, it will go back and forth through the "pre-digested" results (parsing only
when it gets back to the end of the current results and is adding records to
the end), stored in the upper-level Parser object.  (As a minor added bonus,
this will allow us later, if we want, to allow the same parser object to
change sources and types in mid-run, so that it can collect results from
multiple sources with potentially different filetypes, and stack them
together.)