[Biophp-dev] My brain hurts.
Nico Stuurman
biophp-dev@bioinformatics.org
Sat, 3 May 2003 19:33:20 -0700
>> So far,autodetection relied only on the first line. Even for a stream
>> reading a single line is not a high price to be paid. Simply rewrite
>> the
>> autodetect code to read a single line instead of the whole file.
>
> My only worry there is that I think SOME formats might not be
> detectable
> that way (some XML records, or if someone ever adds a parser for HTML
> output
> from a site, or something of the sort), though the basic idea is still
> quite
> feasible - we could have it buffer a certain number of lines (just
> enough to
> ensure that it'll reach some identifiable characteristic in just about
> any
> format).
>
> It'll take a LITTLE bit of extra code then, in the event that someone
> wants
> to parse a stream (the filetype parser will need to be able to accept
> "some header text AND a file resource" and know to do them in
> sequence, but
> I think I can see how to deal with that without too much trouble.)
What about coding lazy and simply have the parser re-open the stream?
Unless it costs much to open stream, there will not be much in terms of
performance penalty and it will make the code look prettier.
>> Will it be possible to do the actual parsing from an array (in
>> memory).
>> For a fasta parser it is not much work writing two parsers (one for in
>> memory parsing, the other for streams), for the genbank,swissport and
>> others it is going to be a pain. Having the parsers read a record into
>> memory and then parse the whole thing would make it easier.
>
> Hmmm, perhaps having the filetype parsers reading one line at a time
> into a
> "temporary record" variable, then parsing that variable when it hits
> the
> "end of record" marker for the format, you mean?
Something like a method readRecordinArray() which simply fills an array
with lines until it finds the end of record mark (or whatever clues
there are it is a the end of a record). The array is then passed to
the 'real' parser that only works with arrays of lines.
> The "set how big of a buffer (of parsed records) you will need" method
> I currently have in there is just the best compromise I've come up
> with so far, so there could easily be a better solution waiting just
> outside of my brain...
>
> Hmmm...under what circumstances will people need to move back and
> re-fetch
> a previous record? That may get me thinking a little clearer on the
> buffer
> issue...
>
Right, who would ever want to go back. It just seems good design to
allow for it....
> I have to run back out again (yesterday and today's schedule, in
> particular,
> is a horrendous mess on my end) but I will be back later this evening,
> and
> I'll put them up then. Shall I go ahead and rename the file I have as
> "parse2.inc.php" back to "parse.inc.php"?
>
Yes.
Nico