On Tuesday 29 April 2003 08:59 pm, nicos@itsa.ucsf.edu wrote: > No problem. I simply want it to work, to be as good as possible, and be > available ASAP. Thanks - I'm working on it now, starting with the fasta parser. > OK. So they will all have their own move_Next(), move_Previous, eof(), > bof(), fetch() and probably also move_First(), move_Last(), move_To() > functions?. I guess the parser class constructors should take either a > filename or a string as an argument. Preferably, there should be a way > to maintain and use an index in a file (as in Serge's seqdb class). > Hmmm, this is all straight forward to do in memory (like it is now), but > probably more difficult with a stream (what streams other than files are > there in php? php treats URLs almost exactly like a file, so...) How > important is it to deal with datastructures larger than the available > memory? I wouldn't have even thought of it before, except for the fact that I once really DID parse through one of NCBI's multi-GB GenBank files in order to create a file of ones matching particular criteria (it was also nice to be able to see the results as the file came down, rather than having to wait for several hours while the file was downloaded first). That sort of thing, plus the occasional set of results from some form of website query (or a system installed on an older machine using GenePHP to redirect queries for multiple people at a time), is really the only time I can think of when you NEED to differentiate between stream-parsing and memory-based parsing. The fact that I can treat an ftp or http url or socket like a filehandle is one of the feature I abuse the heck out of in PHP...the only relevant difference here is that there's no way to feasibly read back to the previous record (without reading the whole stream into memory first), so the class at the "edge" of the system interacting with the stream just needs to know to pass back "false" if asked to go back. > B.t.w., since we now have this large list of methods that every class > should have[...] (I THINK I cover this further down...) > > 2.add instantiation of the (filetype)_parser class > > Will this mean that I instantiate a class Parser, and class Parser finds > the approriate (filetype)_parser class for me? If so, it would be cool > to keep the current include scheme, where only the required parser is > actually included in the running script. Yup - that is one of the things in particular that I really like about the way you've got the Parser interface designed. The end-user shouldn't have to concern him/her/itself with the different classes way out on the "edge" of the system. I'm trying change as little as possible about the way you've got the Parser object designed to work - just moving the parsing functions down to more abstracted classes. The way that section of code reads (not yet tested) at the moment in my version is: if ($this->seqfiletype) { require_once(GENPHP_DIR.'/parsers/parse_'.$this->seqfiletype.'_class.inc.php'); eval("\$this->parserObj = new parse_".$this->seqfiletype."(\$this->flines);"); $this->func= & $this->parserObj->fetch(); // only include the parser we will need //include_once(GENPHP_DIR.'/parsers/'.$this->func.'.inc.php'); // to get the indexing right we'll have to fetch one first $this->fetch(); } else {// if we don't know the seqfiletype, destruct the object: $this=false; } > so fetch(),will simple call (filetype)_parser->fetch()? [...] > so fetch() will get a datastructure from the parser, feed this to object > seq_factory,and get a Seq object back, which it sends back to the calling > party? Exactly so - on the off chance that anyone's already written a bunch with the existing parser in the last few days, their code should CONTINUE to work exactly the same - just "inside" there are couple of additional objects running. The datastructure I'm using is just an array of key=>value pairs, with the keys being the names of the variables in the seq object (they seem to be named such that it's pretty obvious what they all are, so if someone wanted to use the data outside of GenePHP it'd be easy for them to figure out what it meant. (Example - "id"=>"somesequence","sequence"=>"AGCT","seqlength"=>4 [etc.]) Parse::fetch() calls the $parseObj->fetch(); method, and passes that to $seqfactory->createSeq($fetchedArray), returning the resulting sequence. (if desirable later, it'd be trivial to add an argument for Parse->fetch() to cause it to return the data array directly rather than a seq object. Don't know how much that would be in demand, though.) Would it be helpful to add a "fetchNext()" wrapper function to the Parse class (that would "fetch()" and then "moveNext()")? > Sounds OK to me. I would still think about keeping the user functions > (move_Next(), etc..) in the Parse class (provided it is possible, I don't > think interleaved data are a big problem with this scheme). That will > keep the individual parsers more simple. I agree, that'll keep it easy to use. Those calls in the parse class become "wrappers" for the equivalent calls down in the individual file parsers, so they'll continue to operate the way you wrote them, at least presuming I do it properly. > The seq_factory is fine with me (it is a good idea to keep the > translation from what is in the file to our abstraction of the real world > in one place). > The only issue is how to deal with stuff both in memory (small files, > strings, these are currently kept in an array of strings and an index of > the line-numbers with sequence entries is maintained) and in streams (big > files, we read from a file pointer). Is there an easy way to deal with > both? Fortunately, two things SHOULD be true about parsing streams: 1)people should not have to do it too often 2)when they do, they will conceivably know in advance what kind of stream it is, and so will be able to specify the filetype. (auto-detection should really be the only BIG difference in use between streams and files/strings). In order to keep the upper-level Parse object unified, I'd say simply have the stream parsers return "false" for calls to "move_Previous()". In most cases, I suspect when reading a stream the user will just be wanting to extract some or all of the sequences in it as they come in, so this should be no issue. The other alternative that comes to mind is having a separate parse_stream object, but that kind of defeats some of the usefulness of the unified Parse design (and still doesn't change the fact that you can't easily "rewind"...) I probably won't have it finished tonight, but sometime tomorrow I should have a working example to put up somewhere for evaluation before it goes into the official tree.