[Pipet Devel] converters (was Re: SYNERGY)

Sun Oct 17 00:24:37 EDT 1999

Humberto Ortiz Zuazaga wrote:
> 
> Jeff, you've got this exactly backwards. We need an internal format, we
> decided it would be xml based, perhaps extended BSML. Converters should be
> written to any format to ours and from any format to ours, otherwise we get to
> write a converter for every pair of formats we support.
> 
> Example, image we want to support 4 file formats:
> 
> genbank - internal
> pdb - internal
> fasta - internal
> bsml - internal

On the other hand, the requirement of an internal format would force Loci to do
2 conversions:

genbank - internal - genbank
pdb - internal - pdb
fasta - internal - fasta
bsml - internal - bsml

where NONE would be needed without an internal format:

genbank - genbank (not needed)
pdb - pdb         (not needed)
fasta - fasta     (not needed)
bsml - bsml       (not needed)

> vs converters between the same 4 formats:
> 
> genbank - pdb
> genbank - fasta
> genbank - bsml
> pdb - fasta
> pdb - bsml
> fasta - bsml

Maybe if we use a temporary, intermediate format used only during conversion, it
would be much simpler plugging 2 formats together.  From my experience writing
converters, some intermediate format is usually needed anyway.

So, each converter is built by connecting 2 parts together:

genbank - internal  <---> internal - pdb
pdb - internal      <---> internal - bsml
fasta - internal    <---> internal - genbank
bsml - internal     <---> internal - fasta

The problem is, yeah, we're still doing 2 conversions.  But if the 'internal'
format is not a file format (such as XML), it should be quicker and require less
disk space.

> What we had decided is that we can defer defining our file formats until we
> actually have any loci that use them, and that we can have many small
> languages instead of a big language that tries to capture all possible data
> types.

All I am against is just that: a big language that tries to capture all possible
types, therefore needing to be redefined each time we add a new file format to
Loci, and requiring file system reads/writes each time a conversion is done.

We have to ask ourselves this when thinking about the conversion process:

  How is Loci going to handle data from the Genome Projects, where an
  annotated file may be gigabytes to terabytes in size???

> So we'll have an internal format for nucleotide sequences, one for amino acid
> sequences, one for multi sequence objects, one for sequence annotations, one
> for bibliographic references, ...

I'd agree to having an internal format that is really many smaller specialized
formats, if you'd agree that they are used to BUILD converters, for QUICK
conversions, only AS NEEDED, like I wrote above.

Again...

<no-no>
A big language that tries to capture all possible types, therefore needing to be
redefined each time we add a new file format to Loci, and requiring file system
reads/writes each time a conversion is done
</no-no>

Cheers.
Jeff
-- 
                         +----------------------------+
                         |        J.W. Bizzaro        |
                         |  jeff at bioinformatics.org   |
                         |                            |
                         |        THE OPEN LAB        |
                         | Open Source Bioinformatics |
                         |                            |
                         | http://bioinformatics.org/ |
                         +----------------------------+