On Wednesday 30 April 2003 11:42 am, Nico Stuurman wrote: > If I understand it correctly, you now want to the Parse class to > remember the sequence entries that have been read by the parser > functions (instead of just remembering pointers to positions in the > input). When dealing with strings (data in memory) this has the > (slight) disadvantage that you are doubling the memory usage. When > dealing with streams, it has the advantage that you can easily move > back. However, doesn't fseek() let you put the file pointer backwards > (don't know how php handles this, does it re-stream the stream or is it > intelligent and keeps a cache somewhere that you can dip in?). fseek() does let you move around in the file, but as far as I have been able to find out, the only way to approximate "reading backwards" in a "stream" would be to close the stream and re-open it from the start - which works for web pages and ftp, but won't work at all for, e.g. stdin. As far as I know there's no buffering that you can go back in. In effect, having the "stack" of results at the upper-level of the parser serves as the buffer that's not there for streams :-) That does still leave the system using up to 2x the memory, roughly, at its peak for the data, though this can be mitigated somewhat by having the filetype parser unset() its data once it's parsed the last record from it and passed the results up to the Parser object. > How are the parser functions going to deal with streams versus arrays? > It would be nice if that could be abstracted for them at a higher > level, so that not every parsers has to have code for streams and > arrays (yesterday night's idea to abstract the array functions 'next, > 'current', 'each', and 'previous', so that they can be used for either > arrays or streams was getting at that point. What I'm planning is that the individual "file"(or stream)type parsers will all be individual, so there will in the end be two similar but different parser classes for the handful of datatypes that are commonly retrieved as either a stream or a file (e.g. fasta and fasta_stream, genbank and genbank_stream, etc.). Since there's no way to auto-detect streams realistically (without reading them entirely into memory), the user will HAVE to specify a stream-parser if they want it. Fortunately, this won't affect MOST people, since the filetype parser itself will only need to move forward through the records and never backwards. As long as the stream isn't excessively large, since PHP deals with all "filehandle" resources the same (even if they're really streams) the memory-based version of the parsers will still work with auto-detection even if given a stream. That is, as long as the user doesn't try to feed the parser one of the aforementioned 2gb files or streams. In practice, what this means is that if the user KNOWS he's about to read a really really big set of data, THEN he'll specify "new Parser('fasta_stream')" (or whatever), otherwise, regardless of whether the actual data is on a web page, being piped from stdin, or in a normal file, it'll get read entirely into memory and auto-detected. The end-user will see no change in what happens when they tell the parser to go back and forth and fetch records, but instead of next/previous going back and forth through the original data and re-parsing it, it will go back and forth through the "pre-digested" results (parsing only when it gets back to the end of the current results and is adding records to the end), stored in the upper-level Parser object. (As a minor added bonus, this will allow us later, if we want, to allow the same parser object to change sources and types in mid-run, so that it can collect results from multiple sources with potentially different filetypes, and stack them together.)