[Pipet Devel] converters (was Re: SYNERGY)

Mon Oct 18 00:41:25 EDT 1999

> > Example, image we want to support 4 file formats:
> >
> > genbank - internal
> > pdb - internal
> > fasta - internal
> > bsml - internal
> 
> On the other hand, the requirement of an internal format would force Loci to do
> 2 conversions:
> 
> genbank - internal - genbank
> pdb - internal - pdb
> fasta - internal - fasta
> bsml - internal - bsml

Like Humberto mentioned recently, pbmtools automatically converts
everything to an internal format, then reparses it out to the desired
format.  NCBI does the same with their all-encompassing ASN.1 format.
So, in the data-independent Loci model, how would Loci's internal format
be implemented as a plug-in? Would the plug-in developer be responsible
to create a locus that converts all incoming data to an internal format,
like so:

______
|Data|<----idl/api<----Request document
|Base|
|    |	
______----->idl/api---->document--->conversion--------->processing-->result---storage(database,
				   to "loci" internal		  	      (file, etc).				
				   format-then parse
				   to required format

If this is the model, its seems to me that a great deal of work would
befall on the plug-in developers, and the Loci framework itself would be
quite minimal (which is not a bad thing).
This begs the question, how will the loci plug-in to the Loci
architecture? What would Loci be, at this data-independent core? 

I'm not an expert on network-object models, or data object models, or
databases for that matter, so these issues frighten and confuse me.  I'm
beginning to write up the Loci white-pages, so some enlightenment on
these issues would go a long way to help me write intellible stuff! I
have read a bit about AppLab (a Java-based command-line application
wrapper that runs throught CORBA). AppLab is very similar to our design,
although it is bioinfo-centric, as is NetGenics SYNERGY. How do we
decouple the nature of the data from the data-framework itself?

> genbank - genbank (not needed)
> pdb - pdb         (not needed)
> fasta - fasta     (not needed)
> bsml - bsml       (not needed)

Ineresting.  What scenario do you envision for this data 'passthru'
scheme? A genbank doc could be connected to a genbank-readable
processor/widget/whatever without the need of passing thru a convertor,
therefore, no wasted conversion time or resources. Similarly, a
convertor could be constructed to realize that internal format
conversion is not necessary and simply relay the data (ie in dynamic
situations where the format of the incoming data, or the requirements of
the receiving locus are not known in advance).  Comments anyone?

> 
> > vs converters between the same 4 formats:
> >
> > genbank - pdb
> > genbank - fasta
> > genbank - bsml
> > pdb - fasta
> > pdb - bsml
> > fasta - bsml
> 
> Maybe if we use a temporary, intermediate format used only during conversion, it
> would be much simpler plugging 2 formats together.  From my experience writing
> converters, some intermediate format is usually needed anyway.
> 
> So, each converter is built by connecting 2 parts together:
> 
> genbank - internal  <---> internal - pdb
> pdb - internal      <---> internal - bsml
> fasta - internal    <---> internal - genbank
> bsml - internal     <---> internal - fasta
> 
> The problem is, yeah, we're still doing 2 conversions.  But if the 'internal'
> format is not a file format (such as XML), it should be quicker and require less
> disk space.

What are you thinking? I recall from the bioobjects project (from the
bioinformatics journal .pdf that gotcirculated a while back) that the
biosequence data is abstracted into its basic types: raw sequence data,
internal id, Locus or Accession number,  references (including
bibliographic, organism, etc), x-refs to other databases, and feature
information. these data structures are then assembled into an object and
stored in and object database for access via CORBA....
> 
> > What we had decided is that we can defer defining our file formats until we
> > actually have any loci that use them, and that we can have many small
> > languages instead of a big language that tries to capture all possible data
> > types.
> 
> All I am against is just that: a big language that tries to capture all possible
> types, therefore needing to be redefined each time we add a new file format to
> Loci, and requiring file system reads/writes each time a conversion is done.
> 
> We have to ask ourselves this when thinking about the conversion process:
> 
>   How is Loci going to handle data from the Genome Projects, where an
>   annotated file may be gigabytes to terabytes in size???

I'm stymied on this one... ;-)

> 
> > So we'll have an internal format for nucleotide sequences, one for amino acid
> > sequences, one for multi sequence objects, one for sequence annotations, one
> > for bibliographic references, ...
> 
> I'd agree to having an internal format that is really many smaller specialized
> formats, if you'd agree that they are used to BUILD converters, for QUICK
> conversions, only AS NEEDED, like I wrote above.

--gary 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Gary Van Domselaar		gvd at redpoll.pharmacy.ualberta.ca
Faculty of Pharmacy 		Phone: (780) 492-4493
University of Alberta		FAX:   (780) 492-5305
Edmonton, Alberta, Canada       http://redpoll.pharmacy.ualberta.ca/~gvd