[Biophp-dev] My brain hurts.

Nico Stuurman biophp-dev@bioinformatics.org
Sat, 3 May 2003 19:33:20 -0700


>> So far,autodetection relied only on the first line.  Even for a stream
>> reading a single line is not a high price to be paid.  Simply rewrite 
>> the
>> autodetect code to read a single line instead of the whole file.
>
> My only worry there is that I think SOME formats might not be 
> detectable
> that way (some XML records, or if someone ever adds a parser for HTML 
> output
> from a site, or something of the sort), though the basic idea is still 
> quite
> feasible - we could have it buffer a certain number of lines (just 
> enough to
> ensure that it'll reach some identifiable characteristic in just about 
> any
> format).
>
> It'll take a LITTLE bit of extra code then, in the event that someone 
> wants
> to parse a stream (the filetype parser will need to be able to accept
> "some header text AND a file resource" and know to do them in 
> sequence, but
> I think I can see how to deal with that without too much trouble.)

What about coding lazy and simply have the parser re-open the stream?  
Unless it costs much to open stream, there will not be much in terms of 
performance penalty and it will make the code look prettier.

>> Will it be possible to do the actual parsing from an array (in 
>> memory).
>> For a fasta parser it is not much work writing two parsers (one for in
>> memory parsing, the other for streams), for the genbank,swissport and
>> others it is going to be a pain. Having the parsers read a record into
>> memory and then parse the whole thing would make it easier.
>
> Hmmm, perhaps having the filetype parsers reading one line at a time 
> into a
> "temporary record" variable, then parsing that variable when it hits 
> the
> "end of record" marker for the format, you mean?

Something like a method readRecordinArray() which simply fills an array 
with lines until it finds the end of record mark (or whatever clues 
there are it is a the end of a record).  The array is then passed to 
the 'real' parser that only works with arrays of lines.


> The "set how big of a buffer (of parsed records) you will need" method
> I currently have in there is just the best compromise I've come up
> with so far, so there could easily be a better solution waiting just
> outside of my brain...
>
> Hmmm...under what circumstances will people need to move back and 
> re-fetch
> a previous record?  That may get me thinking a little clearer on the 
> buffer
> issue...
>

Right, who would ever want to go back.  It just seems good design to 
allow for it....


> I have to run back out again (yesterday and today's schedule, in 
> particular,
> is a horrendous mess on my end) but I will be back later this evening, 
> and
> I'll put them up then.  Shall I go ahead and rename the file I have as
> "parse2.inc.php" back to "parse.inc.php"?
>

Yes.



Nico