[Biophp-dev] Parser object - streams and files

Nico Stuurman biophp-dev@bioinformatics.org
Wed, 30 Apr 2003 10:42:47 -0700


> Okay, I think I've figured out how to handle it such that both
> streams and files behave exactly the same way from the end-user's
> perspective.

Cool!  I also like the rest of the approach you describe in your mail.

If I understand it correctly, you now want to the Parse class to 
remember the sequence entries that have been read by the parser 
functions (instead of just remembering pointers to positions in the 
input).  When dealing with strings (data in memory) this has the 
(slight) disadvantage that you are doubling the memory usage.  When 
dealing with streams, it has the advantage that you can easily move 
back.  However, doesn't fseek() let you put the file pointer backwards 
(don't know how php handles this, does it re-stream the stream or is it 
intelligent and keeps a cache somewhere that you can dip in?).

How are the parser functions going to deal with streams versus arrays?  
It would be nice if that could be abstracted for them at a higher 
level, so that not every parsers has to have code for streams and 
arrays (yesterday night's idea to abstract the array functions 'next, 
'current', 'each', and 'previous', so that they can be used for either 
arrays or streams was getting at that point.


Best,

Nico





> I was ORIGINALLY going to move all of the data storage and
> manipulation down to the individual filetype parsers, while
> the upper-level Parser object would be little more than a "wrapper".
>
> Instead, how about this - the Parser object would now aquire a
> "maximum history" attribute, and a "stack" to keep the returned
> data arrays on.  The maximum history attribute tells the upper-level
> parser how many records to keep track of before it starts discarding
> the earlier ones.  I move the "next/previous" functions back up
> to the upper-level parser where they were in the first place (hey, I'm
> learning), and now the filetype parser goes back to being much simpler
> and more portable, as it only needs to be able to open
> and parse the data and return the records one at a time when asked 
> rather
> than also tracking them as I was going to try to do.
>
> "fetch()" then returns the sequence object (and/or perhaps depending
> on a passed parameter, the data array itself) derived from the
> upper-level parser's current position in its stack.
>
> "moveNext()", then:
> 1)checks to see if it's on the last record of the stack
> 2)if it is, it calls the "gimme the next record" function of the 
> filetype
> parser and appends the resulting array to the stack (unless it's at 
> eof)  If
> not, skip to step 4
> 3)if the size of the array is now larger than maximum history, 
> array_shift the
> first record into oblivion.
> 4)advance the pointer to the next record
>
> There's still the limitation that you can't "movePrevious" any further 
> than
> (maximum history) records, but I figure with a good default size, it
> won't matter much, and you can then move both forward and backwards in
> the data even when parsing a stream [without fear of ending up with 
> 2GB of
> data stored in memory].  (I was thinking 1000 records as the default 
> maximum
> size).
>
> What do you all think?
> _______________________________________________
> Biophp-dev mailing list
> Biophp-dev@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biophp-dev
>