[Biophp-dev] Import modules structure and interrelation

S Clark biophp-dev@bioinformatics.org
Sat, 20 Mar 2004 23:45:32 -0700


=2D----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Depending on how you count, there are either 2 or 3 'modules' that all
go together to make up the import capabilities.

seqIOImport itself
the 'specific file type' parser module
and (depending how you count)
seqFactory.

There's very little that needs to be done to seqIOimport (and nothing
for seqFactory) to add a new import module.

There are only a couple of requirements to fit a 'specific file type'
module (such as the locuslink parser you are working on)=20
1.)the class needs to be able to accept a file name, a file handle, or
"text" (which, I suppose, could actually be binary data) as an input source.
(this is so that we can handle data from a network socket connected to
a server, http:// or ftp:// URL's, files on the local hard drive, or data
already read into memory from other sources)

2.)the class needs to accept the input source on instantiation
(i.e. $parser =3D new locuslink_import($input_source) )

3.)the class SHOULD have a "setSource()" interface (which sets or
changes the input source - seqIOimport doesn't currently use this, but
it could in the future - i.e. for parsing multiple files in one shot).

4.)the class MUST have a fetchNext() interface, which returns an associativ=
e=20
array with the next parsed sequence data.  (e.g. 'id'=3D>'(name of=20
sequence)','sequence'=3D>'ACGTACGTACGT...') )  We're using this type of=20
'generic' associative array as a format for exchange sequence information
between modules so as to make the individual modules usable by themselves
(i.e. you can use the fasta parser module all by itself [outside of=20
seqIOimport] without knowing anything about the seq class format...)

5.)When imported into the BioPHP framework, it goes into the 'parsers'
section, named (filetype).inc.php (e.g. "swissprot.inc.php"). =20

That last requirement is just so that it can be found and auto-loaded
by the seqIOimport module.

seqIOimport is only a 'go-between' - it handles (where possible)=20
auto-detection of filetypes and calling of the appropriate parser, and
acting as a frontend to the parsed sequence data (it can either return
the 'raw' associative array results from the 'filetype' parsers, or it
can pass the data to 'seqFactory', which is in charge of generating
seq objects from the data.)

Adding a new filetype parser to seqIOimport takes only one to three additio=
nal
steps:

1.)REQUIRED - add the name of the filetype (e.g. 'locuslink') to
the list of recognized filetypes.
( $this->seqfiletypes=3Darray('fasta','clustal','lasergene','pdraw','genban=
k','swissprot'); )

2.)OPTIONAL (but desirable) - add the 'file extension' to the 'detect filet=
ype
by filename' feature, if applicable (the typeByName($name) method)

3.)OPTIONAL (but desirable) - and add pattern of the first line of data by
which seqIOimport can recognize the type of data (the autodetect() method)

Everything's been designed as much as possible so far such that each=20
individual component needs to know only the barest minimum about the
other components - seqIOimport only needs to know 'call the filetype parser
with the data source' and 'call fetchNext() to get the next sequence', (and=
 to=20
call seqFactory to generate sequence objects) and that's it.  The filetype
parser only needs to know it's getting a data source on instantiation, and
that it needs to respond to 'fetchNext()' with the next parsed sequence's=20
information.  seqFactory only needs to know that it's getting an associative
array (and what common terms will be in the array) and how to feed that info
to the seq object.

It's hoped that this will make it very easy for people to pop in and=20
contribute (in this case) import modules, since you don't need to 'learn' t=
he=20
rest of the modules to do so.

Does any of this help?....

P.S. to answer your SPECIFIC question - $flines is the data read from the
source passed to the swissprot parser - the swissprot parser has no
knowledge at all of the existence of the seqIOimport module that loads
it (and, indeed, might conceivably be called directly in a script rather=20
than through the seqIOimport 'wrapper').  (I note that the version of
the parser that I'm looking at reads:

while ( list($no, $linestr) =3D each($sourcelines) ) {

so you probably do have a slightly older version.

I've probably mangled this whole explanation, so please feel free
to ask me what the heck I mean :-)

Sean

On Friday 19 March 2004 03:24 am, Frederic.Fleche@aventis.com wrote:
> Hello all,
>
> I am planning to do a locuslink-file parser.
> So I read the swissprot parser in order to get some good ideas.
> Since my knowledge in php is not as good as yours I have a newbie question
> concerning the following line of the function parse_swissprot
>
> while (list($no, $linestr) =3D each($flines))
>
> if $flines is from $seqIOimport->flines, I understand cause it is an array
>
> if $fines is from $seqIOimport->fp, I don't understand cause it is a file
> handle or does it work in the same way ?
=2D----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.9.5 (GNU/Linux)

iD8DBQFAXToPJ6yQLhNTzSkRAnkUAKCvpA7cqQDaMnm0sJFZ4RX1lQ42ZACdFtE6
Kv1WWSpIElN2YxreLYT5avc=3D
=3DCT1a
=2D----END PGP SIGNATURE-----