[Biophp-dev] My brain hurts.

S Clark biophp-dev@bioinformatics.org
Fri, 2 May 2003 10:51:15 -0600


On Thursday 01 May 2003 10:28 pm, nicos@itsa.ucsf.edu wrote:
> Actually, please commit them to the cvs.  That will make it much easier
> for me to work with them.  It is always possible to revert to older
> versions.  Once they are in cvs, I can also make changes and commit them
> without getting outof sync with what you are doing.

Okay- I've got to head out of town for a bit, but if nobody else has
already dumped them in before I get back, I'll commit them.

> Just from looking at the code:  It looks as if seq_factory does not know
> what parser type it is dealing with.  I thought that every parser could
> return their own datastructure and that the 'translation' only takes
> place in seq_factory.  Now, it looks as if every parser should return an
> 'id', 'sequence', and 'seqlength'.  If seq_factory knows what is coming,
> we could even use the current genbank parser (just let it return a
> seqobject, seqfactory will pass it through).  Should be easy to add.

The seq_factory should never have to know what's inside another object, with
the exception of the one object that it's built to deal with directly (the seq
object).  The way I'm seeing the ideal design is that the parser's job
is to convert the "proprietary" format to a "neutral" one that is easy
for any other object to interpret...to the extent that they can.  If the
seq_factory has to know all of the terms in the individual formats for
the parsers, then the parsers really no longer have any purpose (i.e. that
would mean the actual PARSING is being done outside of the parser...)

On the other hand, you bring up a good point - I would like to add a moderate
"term translation" for a reasonable set of common terms, to make things a 
little more flexible and easy for anyone writing parsers (i.e. conversion of
"ID" and "Id" and "Name" and so on to "id", etc.)  This will also give us
a mechanism in the seq_factory in the event that someday we completely change
the format - we could then use the term translation to support the older
deprecated format of the data for a while.

> And: bravo, you are using 4 spaces instead of tabs!

Thanks :-)  I'm trying to get myself into the habit of sticking
to the Pear coding standards and naming conventions.

> > 3)Parsers SHOULD also accept raw text, filehandles, or filenames.
> > (only relevant when autodetection is being bypassed).
>
> Is the code for dealing with this already there?  I did not notice
> anything about streams.  It would not be pretty to have to write
> different parsers for arrays of lines and streams.

The code for all of that is already in the memory-based FASTA parser.

I was ABOUT to say that there's no way around having separate stream and
file based parsers...but as I think about it, I think I'm wrong :-)  It
will add a choice of either more complexity (building two different means
of working through the data, one memory-based and the other stream based)
or losing a little of the speed/elegance available to an entirely memory-based
parser (having to deal with even memory based data one line at a time rather
than all records at once), but as I think about it I think you're right that
the little extra work either way are outweighed by the convenience of not
having to pick from two different parsers.  If the upper-level Parse
object's design still looks okay, I'll go ahead and go back and fix
the fasta parser to deal with streams and memory-based data the same.
(It WILL deal with streams as it's written, but only if the stream will
fit entirely in memory...)

> > 4)Parsers MUST have a "fetchNext()" method, which returns the next
> > parsed record (starting with the first one, obviously) as an array, made
> > up of whatever key=>value pairs are available in the format.  The
> > keys MUST
> > be named after the attributes in the seq object (e.g. "id"),and SHOULD
> > begin "id" and "sequence".  This method MUST return false if there are no
> > more records.
>
> I think you don't have to require a certain naming scheme.  Seq_factory()
> could do the translation as long as it knows what is coming and how to
> translate it.

As mentioned above, we'll probably want to avoid pushing the job of parsing
every format's term's out of the parsers, but more "lenience" in what the
seq_factory can understand would definitely be a good idea.
(I definitely need to, at the least, add a "if there's no key that translated
to 'id', assume the first and second fields are id and sequence, respectively"
bit of code - I was in a bit of a hurry to get stuff out for review and kinda
forgot to do that...)

Should we require a particular ORDER rather than naming convention?  I chose
keynames instead of order (except for the first two fields, which should both
exist in ANY sequence format) thinking that this would make it easier to
deal with the fact that different fields will be missing from different
formats.

One ALTERNATIVE would be to move all but the basic information (name,
sequence) in the seq object itself into a key=>value array (e.g. the
"annotations" array in my original sequence object design on the module_code
page.) - that would push "knowing exactly which fields you're dealing with"
to a certain extent out to the end-user level, but would at least give
us a place to store fields we haven't yet "hard-coded" into the seq object.

> OK, the parser fetches a record and does a (usually inherent)
> parserObj->moveNext(), so I guess fetchNext() is the right name.
>
> B.t.w. do we go for fetchNext() or fetch_Next()?  Although I used the
> latter, I'd actually prefer the first one.

Either way works for me - "fetchNext" is the style I've gotten used to
using, but there's nothing at all wrong with fetch_Next() instead.  I 
think the only reason I didn't use that is laziness about typing
the extra underscores...

Okay, I have to run - will commit changes to CVS upon return, if nobody's
beaten me to it before then.

Sean