[Biophp-dev] Import modules structure and interrelation

Nico Stuurman biophp-dev@bioinformatics.org
22 Mar 2004 08:11:27 -0800


Great work here!

This should go straight into 
biophp/genephp/devdocs/parsers.txt
or something similar...


Best,

Nico

> Hash: SHA1
> 
> Depending on how you count, there are either 2 or 3 'modules' that all
> go together to make up the import capabilities.
> 
> seqIOImport itself
> the 'specific file type' parser module
> and (depending how you count)
> seqFactory.
> 
> There's very little that needs to be done to seqIOimport (and nothing
> for seqFactory) to add a new import module.
> 
> There are only a couple of requirements to fit a 'specific file type'
> module (such as the locuslink parser you are working on) 
> 1.)the class needs to be able to accept a file name, a file handle, or
> "text" (which, I suppose, could actually be binary data) as an input source.
> (this is so that we can handle data from a network socket connected to
> a server, http:// or ftp:// URL's, files on the local hard drive, or data
> already read into memory from other sources)
> 
> 2.)the class needs to accept the input source on instantiation
> (i.e. $parser = new locuslink_import($input_source) )
> 
> 3.)the class SHOULD have a "setSource()" interface (which sets or
> changes the input source - seqIOimport doesn't currently use this, but
> it could in the future - i.e. for parsing multiple files in one shot).
> 
> 4.)the class MUST have a fetchNext() interface, which returns an associative 
> array with the next parsed sequence data.  (e.g. 'id'=>'(name of 
> sequence)','sequence'=>'ACGTACGTACGT...') )  We're using this type of 
> 'generic' associative array as a format for exchange sequence information
> between modules so as to make the individual modules usable by themselves
> (i.e. you can use the fasta parser module all by itself [outside of 
> seqIOimport] without knowing anything about the seq class format...)
> 
> 5.)When imported into the BioPHP framework, it goes into the 'parsers'
> section, named (filetype).inc.php (e.g. "swissprot.inc.php").  
> 
> That last requirement is just so that it can be found and auto-loaded
> by the seqIOimport module.
> 
> seqIOimport is only a 'go-between' - it handles (where possible) 
> auto-detection of filetypes and calling of the appropriate parser, and
> acting as a frontend to the parsed sequence data (it can either return
> the 'raw' associative array results from the 'filetype' parsers, or it
> can pass the data to 'seqFactory', which is in charge of generating
> seq objects from the data.)
> 
> Adding a new filetype parser to seqIOimport takes only one to three additional
> steps:
> 
> 1.)REQUIRED - add the name of the filetype (e.g. 'locuslink') to
> the list of recognized filetypes.
> ( $this->seqfiletypes=array('fasta','clustal','lasergene','pdraw','genbank','swissprot'); )
> 
> 2.)OPTIONAL (but desirable) - add the 'file extension' to the 'detect filetype
> by filename' feature, if applicable (the typeByName($name) method)
> 
> 3.)OPTIONAL (but desirable) - and add pattern of the first line of data by
> which seqIOimport can recognize the type of data (the autodetect() method)
> 
> Everything's been designed as much as possible so far such that each 
> individual component needs to know only the barest minimum about the
> other components - seqIOimport only needs to know 'call the filetype parser
> with the data source' and 'call fetchNext() to get the next sequence', (and to 
> call seqFactory to generate sequence objects) and that's it.  The filetype
> parser only needs to know it's getting a data source on instantiation, and
> that it needs to respond to 'fetchNext()' with the next parsed sequence's 
> information.  seqFactory only needs to know that it's getting an associative
> array (and what common terms will be in the array) and how to feed that info
> to the seq object.
> 
> It's hoped that this will make it very easy for people to pop in and 
> contribute (in this case) import modules, since you don't need to 'learn' the 
> rest of the modules to do so.
> 
> Does any of this help?....
> 
> P.S. to answer your SPECIFIC question - $flines is the data read from the
> source passed to the swissprot parser - the swissprot parser has no
> knowledge at all of the existence of the seqIOimport module that loads
> it (and, indeed, might conceivably be called directly in a script rather 
> than through the seqIOimport 'wrapper').  (I note that the version of
> the parser that I'm looking at reads:
> 
> while ( list($no, $linestr) = each($sourcelines) ) {
> 
> so you probably do have a slightly older version.
> 
> I've probably mangled this whole explanation, so please feel free
> to ask me what the heck I mean :-)
> 
> Sean
> 
> On Friday 19 March 2004 03:24 am, Frederic.Fleche@aventis.com wrote:
> > Hello all,
> >
> > I am planning to do a locuslink-file parser.
> > So I read the swissprot parser in order to get some good ideas.
> > Since my knowledge in php is not as good as yours I have a newbie question
> > concerning the following line of the function parse_swissprot
> >
> > while (list($no, $linestr) = each($flines))
> >
> > if $flines is from $seqIOimport->flines, I understand cause it is an array
> >
> > if $fines is from $seqIOimport->fp, I don't understand cause it is a file
> > handle or does it work in the same way ?
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.9.5 (GNU/Linux)
> 
> iD8DBQFAXToPJ6yQLhNTzSkRAnkUAKCvpA7cqQDaMnm0sJFZ4RX1lQ42ZACdFtE6
> Kv1WWSpIElN2YxreLYT5avc=
> =CT1a
> -----END PGP SIGNATURE-----
> _______________________________________________
> Biophp-dev mailing list
> Biophp-dev@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biophp-dev
> 
>