[Biophp-dev] My brain hurts.

Nico Stuurman biophp-dev@bioinformatics.org
Fri, 2 May 2003 11:30:31 -0700


>> Just from looking at the code:  It looks as if seq_factory does not 
>> know
>> what parser type it is dealing with.  I thought that every parser 
>> could
>> return their own datastructure and that the 'translation' only takes
>> place in seq_factory.  Now, it looks as if every parser should return 
>> an
>> 'id', 'sequence', and 'seqlength'.  If seq_factory knows what is 
>> coming,
>> we could even use the current genbank parser (just let it return a
>> seqobject, seqfactory will pass it through).  Should be easy to add.
>
> The seq_factory should never have to know what's inside another 
> object, with
> the exception of the one object that it's built to deal with directly 
> (the seq
> object).  The way I'm seeing the ideal design is that the parser's job
> is to convert the "proprietary" format to a "neutral" one that is easy
> for any other object to interpret...to the extent that they can.  If 
> the
> seq_factory has to know all of the terms in the individual formats for
> the parsers, then the parsers really no longer have any purpose (i.e. 
> that
> would mean the actual PARSING is being done outside of the parser...)
>


OK, that makes the seq_factory class completely useless.  Why not let 
the parsers return the very neutral seq object directly?  Why introduce 
another layer that does the exact same thing as the seq object?  Where 
is the (complete) design of the 'neutral, intermediary' format?

If this is really your idea, I vote against it.  It simply builds an 
extra layer that does exactly the same thing the seq object is supposed 
to be doing.  If you do not agree with the design of the seq object, 
that is another story, and something that should be discussed.



>>> 3)Parsers SHOULD also accept raw text, filehandles, or filenames.
>>> (only relevant when autodetection is being bypassed).
>>
>> Is the code for dealing with this already there?  I did not notice
>> anything about streams.  It would not be pretty to have to write
>> different parsers for arrays of lines and streams.
>
> The code for all of that is already in the memory-based FASTA parser.
>

Yes, but it puts the complete file in an array of strings (just like my 
previous version did).  So, it can not deal with streams until it read 
the whole thing into memory.


> I was ABOUT to say that there's no way around having separate stream 
> and
> file based parsers...but as I think about it, I think I'm wrong :-)  It
> will add a choice of either more complexity (building two different 
> means
> of working through the data, one memory-based and the other stream 
> based)
> or losing a little of the speed/elegance available to an entirely 
> memory-based
> parser (having to deal with even memory based data one line at a time 
> rather
> than all records at once), but as I think about it I think you're 
> right that
> the little extra work either way are outweighed by the convenience of 
> not
> having to pick from two different parsers.  If the upper-level Parse
> object's design still looks okay, I'll go ahead and go back and fix
> the fasta parser to deal with streams and memory-based data the same.
> (It WILL deal with streams as it's written, but only if the stream will
> fit entirely in memory...)

As I wrote now a couple of times, an elegant way would be to make 
functions 'next','current','previous','each', that not only work on 
arrays (simply pass the call on the php function), but also on streams 
(do an fgets, and keep track of the file pointer).

>
>>> 4)Parsers MUST have a "fetchNext()" method, which returns the next
>>> parsed record (starting with the first one, obviously) as an array, 
>>> made
>>> up of whatever key=>value pairs are available in the format.  The
>>> keys MUST
>>> be named after the attributes in the seq object (e.g. "id"),and 
>>> SHOULD
>>> begin "id" and "sequence".  This method MUST return false if there 
>>> are no
>>> more records.
>>
>> I think you don't have to require a certain naming scheme.  
>> Seq_factory()
>> could do the translation as long as it knows what is coming and how to
>> translate it.
>
> One ALTERNATIVE would be to move all but the basic information (name,
> sequence) in the seq object itself into a key=>value array (e.g. the
> "annotations" array in my original sequence object design on the 
> module_code
> page.) - that would push "knowing exactly which fields you're dealing 
> with"
> to a certain extent out to the end-user level, but would at least give
> us a place to store fields we haven't yet "hard-coded" into the seq 
> object.
>


I am lost now about what you want to accomplish.  You don't like the 
parsers to return an object, but you do want them to return an array 
that has to adhere to very specific descriptions?  This looks more like 
a psychological hangup over classes and objects rather than anything 
meaningful.  OR seq_factory knows what file format it is dealing with 
and translates from the parsers own format to the seq class, OR we have 
the parsers return a seq object directly.  Just making yet another 
format to store the data and have that as an intermediate is plain out 
silly.

>> B.t.w. do we go for fetchNext() or fetch_Next()?  Although I used the
>> latter, I'd actually prefer the first one.
>
> Either way works for me - "fetchNext" is the style I've gotten used to
> using, but there's nothing at all wrong with fetch_Next() instead.  I
> think the only reason I didn't use that is laziness about typing
> the extra underscores...


Let's go for fetchNext() then.


I am now a little unsure what you accomplished.  You replaced the seq 
object by an 'intermediate' array that is basically an object without 
functions (but otherwise exactly the same as what the seq object 
intends to be), and you replaced the approach of keeping track of 
positions by tracking the index (pointer) to the data, by buffering the 
data in an array (causing much more data being moved in memory than 
otherwise would be needed).  I don't think progress has been made in 
dealing with streams.  Am I missing something?





Nico