Humberto Ortiz Zuazaga wrote: > > Jeff, you've got this exactly backwards. We need an internal format, we > decided it would be xml based, perhaps extended BSML. Converters should be > written to any format to ours and from any format to ours, otherwise we get to > write a converter for every pair of formats we support. > > Example, image we want to support 4 file formats: > > genbank - internal > pdb - internal > fasta - internal > bsml - internal On the other hand, the requirement of an internal format would force Loci to do 2 conversions: genbank - internal - genbank pdb - internal - pdb fasta - internal - fasta bsml - internal - bsml where NONE would be needed without an internal format: genbank - genbank (not needed) pdb - pdb (not needed) fasta - fasta (not needed) bsml - bsml (not needed) > vs converters between the same 4 formats: > > genbank - pdb > genbank - fasta > genbank - bsml > pdb - fasta > pdb - bsml > fasta - bsml Maybe if we use a temporary, intermediate format used only during conversion, it would be much simpler plugging 2 formats together. From my experience writing converters, some intermediate format is usually needed anyway. So, each converter is built by connecting 2 parts together: genbank - internal <---> internal - pdb pdb - internal <---> internal - bsml fasta - internal <---> internal - genbank bsml - internal <---> internal - fasta The problem is, yeah, we're still doing 2 conversions. But if the 'internal' format is not a file format (such as XML), it should be quicker and require less disk space. > What we had decided is that we can defer defining our file formats until we > actually have any loci that use them, and that we can have many small > languages instead of a big language that tries to capture all possible data > types. All I am against is just that: a big language that tries to capture all possible types, therefore needing to be redefined each time we add a new file format to Loci, and requiring file system reads/writes each time a conversion is done. We have to ask ourselves this when thinking about the conversion process: How is Loci going to handle data from the Genome Projects, where an annotated file may be gigabytes to terabytes in size??? > So we'll have an internal format for nucleotide sequences, one for amino acid > sequences, one for multi sequence objects, one for sequence annotations, one > for bibliographic references, ... I'd agree to having an internal format that is really many smaller specialized formats, if you'd agree that they are used to BUILD converters, for QUICK conversions, only AS NEEDED, like I wrote above. Again... <no-no> A big language that tries to capture all possible types, therefore needing to be redefined each time we add a new file format to Loci, and requiring file system reads/writes each time a conversion is done </no-no> Cheers. Jeff -- +----------------------------+ | J.W. Bizzaro | | jeff at bioinformatics.org | | | | THE OPEN LAB | | Open Source Bioinformatics | | | | http://bioinformatics.org/ | +----------------------------+