[Biophp-dev] My brain hurts.

biophp-dev@bioinformatics.org biophp-dev@bioinformatics.org
Sat, 3 May 2003 18:32:05 -0600


On Saturday 03 May 2003 05:51 pm, nicos@itsa.ucsf.edu wrote:
> First, the tone of my previous mail was a little too harsh (sorry!), but
> my gut feeling is that some harsh discussion now can save a whole lot of
> trouble later on

Well, that's okay too, I was also trying to answer with no food in my system
and not enough sleep last night, so my coherency may have suffered a bit...

> I realized that too (after sending my previous mail).  This seems
> especially important as Serge apparently already changed the Seq object!

Wow, things in the real world conspiring to make me look RIGHT about
something?!?!?  Is that legal? :-)
>
> The only real difficulty is to have a real good seq_factory object that
> is not going to change (additions are OK, but we don't want any name
> changes, since that will have to be fixed in many places).

I agree completely here - ideally I think the seq_factory should end
up with excellent ability to deal with "generic" data (that is, it's
not programmed for specific individual formats but is good at "figuring out"
generalized terms - such as knowing that any term it is given that is named
like "id" or "name" or "label" or whatever maps to the "id" attribute of
the seq object.) so that objects that want to generate seq objects
need to "know" a minimum about the formats.

> > In the interest of more harmonious development, however, I propose
> > then that we go ahead and have the upper-level Parse object link
> > directly to the seq object and comment out seq_factory references for
> > now.
>
> Just when you started to convince me!

Oh, well, in that case, forget I wrote that :-)  
If nothing else, it wouldn't be hard to comment-out the seq_factory references
and temporarily replace them with direct seq object calls until the 
seq_factory reaches a satisfactory state.

> Why not extend the seq_factory object so that it can take different types
> of input (and potentially) even different types of output?  You can
> default to seq_factory->Seq object, but could also allow other types of
> translations.

That's definitely the direction I think we should head with it - in its 
present state it's actually a bit of an "afterthought" that came up as I
was working on the other code, and I'm afraid it shows.  ("Okay, I can get
it to make valid seq objects, I'll put in the rest later..." )

> So far,autodetection relied only on the first line.  Even for a stream
> reading a single line is not a high price to be paid.  Simply rewrite the
> autodetect code to read a single line instead of the whole file.

My only worry there is that I think SOME formats might not be detectable
that way (some XML records, or if someone ever adds a parser for HTML output
from a site, or something of the sort), though the basic idea is still quite
feasible - we could have it buffer a certain number of lines (just enough to
ensure that it'll reach some identifiable characteristic in just about any
format).

It'll take a LITTLE bit of extra code then, in the event that someone wants
to parse a stream (the filetype parser will need to be able to accept
"some header text AND a file resource" and know to do them in sequence, but
I think I can see how to deal with that without too much trouble.)

> Will it be possible to do the actual parsing from an array (in memory).
> For a fasta parser it is not much work writing two parsers (one for in
> memory parsing, the other for streams), for the genbank,swissport and
> others it is going to be a pain. Having the parsers read a record into
> memory and then parse the whole thing would make it easier.

Hmmm, perhaps having the filetype parsers reading one line at a time into a 
"temporary record" variable, then parsing that variable when it hits the
"end of record" marker for the format, you mean?  

Good idea, I think. (This is BASICALLY what the event-based XML parsers seem
to have to do anyway).

> Why not check the filesize, if small enough parse in memory, if too big
> or unknown parse the stream.  For streams, movePrevious() simply returns
> false.  That would allow for simple parsers that don't even have to know
> whether they deal with a stream or a (array of) string(s).  In my code
> the parser rewinds the pointer to the beginning of the record.  That
> could be avoided by buffering just a single record.
>
> Hmm.  I am still unsure myself, maybe buffering is a better/nicer idea.

The "set how big of a buffer (of parsed records) you will need" method
I currently have in there is just the best compromise I've come up
with so far, so there could easily be a better solution waiting just
outside of my brain...

Hmmm...under what circumstances will people need to move back and re-fetch
a previous record?  That may get me thinking a little clearer on the buffer
issue...

> I would formulate this a little different.  There are always many ways to
> design a system (be it OO,or otherwise),and we are simply trying to find
> the best possible way. Since we are with more people we have more
> viewpoints, which will make the discussion phase longer, but also -
> hopefully - lead to a better implementation.  I do think that it will be
> easier to use objects (but don't care much eitherway).

Very much agreed here - Even from a completely "selfish" perspective I still
think the open collaboration and discussion will benefit my own abilities as
well as everyone else's.  

Part of my problem is that since I've mentally committed myself to heavily
Object-Oriented design, I'm going through the internal struggle of "if I get
too lenient, I'll never bother to properly learn it" vs "if I get too
zealous, I may cause problems for others".  (I'm very much a "learn by doing"
sort of person - I find it difficult to pick up a skill by "merely" reading
about, but as soon as I can work on a practical application, I can usually
pick things up fairly quickly.)

> Why don't you go ahead and put the code in cvs and show how you want to
> deal with streams versus strings?

I have to run back out again (yesterday and today's schedule, in particular,
is a horrendous mess on my end) but I will be back later this evening, and
I'll put them up then.  Shall I go ahead and rename the file I have as
"parse2.inc.php" back to "parse.inc.php"?

> In the mean time, Serge is developing a complete system of objects that
> we haven't spend a word on....
>
> Wwhat is the name of this class going to be?  IOseq_read?  Or should
> there be a IOseq object that has 'read()' as a method?

Nifty, I'll have to take a look at those after I've gotten back this evening
and caught up.

I get the feeling development here's about to seriously take off...