[Biophp-dev] export/write object

S Clark biophp-dev@bioinformatics.org
Wed, 14 May 2003 10:59:02 -0600


On Sunday 11 May 2003 11:52 am, nicos@itsa.ucsf.edu wrote:
[...]
>
> Clustalw.  Have a look at the code in cvs.

Which, incidentally, I'd accidentally broken on a previous
commit - I fixed it again yesterday.  Should be working now.

I also committed the initial implementation of the "handle common
synonyms" layer (_convertTerms()) for translating common
terms to the terms used in the seq object.

> Actually, I do think that names and naming conventions are going to be
> important in the long run.  How well we choose the names, naming
> conventions and how well we stick to them will determine how easy biophp
> can be used.

Some combination of "write" (or "export") and "seq" seems appropriate to me
for this particular section.  I kind of like "export" just because
it doesn't imply that the destination of the data going out is 
a file (or printout :-) ), but that's pure semantic niggling and
doesn't really matter...

> But first a strcuture for the IOwrite class. I would go for a constructor
> that takes an argument specifying the type of output desired (string,
> array, file, filehandle?, or simply always return a string?), and the
> type of sequence file desired (fasta,swissprot, genbank, etc..).  There
> should be a IO->write->add($seq) function that calls seq_factory, which
> should translate the items of object $seq in items that can be directly
> incorporated in the output.  The actual 'write' methods could almost be
> just a template where php's variable interpolation can do the work.

Hmmm, how's this:

1) Add a "getAsArray()" method to the seq object, which returns an
array containing all of the 'set' attributes and their values (key=attribute
["sequence","id", etc.], value=value of that attribute).  This
will also substitute as a "wrapper" for all of the other interface methods
at once (i.e. so the user doesn't have to do "getId(); getSequence();"
(etc...) if they want all of the seq object's data.)

2)The IOwrite (or IOWriteSeq?) should include methods to set the destination
(as you describe above - string, array, file, handle...) and type. (this way
the user can use the same instance of the writer object to produce multiple
files if desired).

3)The IOwrite object can have a "stack" where the extracted attributes get
stored as "generic arrays" (this way someone can write a file converter
[e.g. genbank to fasta, or clustal to phylip] without the extra baggage of
creating seq objects [which are only going to be read back out of and
destroyed anyway in that case] - the 'fetchRawRecord()' method of the Parse
object is for this sort of thing).

4)if given (to an "add()" method) an "array" of attributes, IOWrite
just shoves them on the stack. If passed a seq object , IOWrite calls its
"getAsArray()" method and shoves the results of that on the stack.  (The
"stack" is necessary when export is to interleaved file formats).  We MIGHT
include a "write()" (or some similar name) method to allow bypassing
the "stack" and writing immediately for non-interleaved formats (returns false
if called while set to an interleaved format).

5)Perhaps I should move the "translation" layer back out of seq_factory
and into a separate class.  The "Translate" class wouldn't need to
be instantiated, but it would make a variety of minor "correction" functions
available everywhere as, e.g. "Translate::toSeq()".  More an "ease of re-use"
issue than anything technical, though.  There's no reason I can't make
"_convertTerms()" into a public method and have people call it from
outside as "seq_factory::convertTerms();"

If I DID make a separate "Translate" class to be used like this, it might
also include things like "Translate::NCBIDeflineExtract($field)" which
one could use to get, e.g., just the accession number out of an NCBI Defline.

It might also be worth the trouble to move a lot of the "common" functions
that are currently in the class files but not part of the classes (e.g.
the "complement()" function in seq.inc.php) where they can be accessed
by other object (or have the file be utilized by itself by other projects).
(I think doing that will also make the actual seq objects [and others] take
up less resources since there'll only be one copy of the "common" methods
rather than a copy in each instance of the classes).

I'd strongly advocate getting interface methods implemented in the seq object
soon - as I read up on Object Oriented design I keep seeing it said that that
you're "supposed to" use them instead of setting variables directly (even for
public variables, it would seem), and I'm beginning to see why - when you have
people using an interface method to set variables, you can do things like
validity checking, error correction, and transparently handling internal
changes (e.g. changing variable names [e.g. to meet PEAR standards on naming],
"splitting" variables, moving variables into an array for easier handling,
etc.) without breaking other objects, etc.

For example, right now everyone is expected to directly set $seq->sequence and
$seq->moltype directly, which means I can easily accidently 
$seq->sequence='ZXKUQYB'; $seq->moltype='DNA';

whereas if people are able to use a "setSequence()" method, we can add
auto-detection of the type whenever the sequence is set (and "setMolType()"
can check the existing sequence to see if it's valid for that type...)

I was thinking about editing my old sequence class to make it "seq compatible"
and dropping it in as "alt_seq.inc.php", where we can compare them
side-by-side and merge the useful features of each.  Thoughts?

I'm thinking I should quit stalling and get back to finishing the NCBI
Blast query handler first, though...