[Pipet Devel] converters (was Re: SYNERGY)

Wed Oct 27 19:59:19 EDT 1999

It's about time I got back to this :-)

Gary Van Domselaar wrote:
>
> Like Humberto mentioned recently, pbmtools automatically converts
> everything to an internal format, then reparses it out to the desired
> format.  NCBI does the same with their all-encompassing ASN.1 format.
> So, in the data-independent Loci model, how would Loci's internal format
> be implemented as a plug-in?

Yes.  Although an 'internal plug-in' is an oxymoron :-)

I want (and I think everyone else here wants) as many aspects of Loci as
possible to be plugged-in or modular.  So instead of an 'internal format', I
would call it a 'neutral conversion format'.  And like any other data format, it
is only present in Loci as long as the locuses are there to work with it.

> Would the plug-in developer be responsible
> to create a locus that converts all incoming data to an internal format,
> like so:
> 
> ______
> |Data|<----idl/api<----Request document
> |Base|
> |    |
> ______----->idl/api---->document--->conversion--------->processing-->result---storage(database,
>                                    to "loci" internal                         (file, etc).
>                                    format-then parse
>                                    to required format

If the plug-in/locus developer is dealing with a data format new to Loci, they
should provide SOME sort of converter.  A converter that ouputs data in the
neutral conversion format would be minimal.

> If this is the model, its seems to me that a great deal of work would
> befall on the plug-in developers, and the Loci framework itself would be
> quite minimal (which is not a bad thing).

I think the less that Loci comes standard with, the more malleable it will be,
which is a Good Thing.

> This begs the question, how will the loci plug-in to the Loci
> architecture? What would Loci be, at this data-independent core?

Locuses/loci should be able to

  (1) Communicate with the Workspace to send/receive GUI information
  (2) Communicate with a directory service or 'hub' establish
      the CORBA connections

> I'm not an expert on network-object models, or data object models, or
> databases for that matter, so these issues frighten and confuse me.

I too am frightened and confused.

> I'm
> beginning to write up the Loci white-pages, so some enlightenment on
> these issues would go a long way to help me write intellible stuff! I
> have read a bit about AppLab (a Java-based command-line application
> wrapper that runs throught CORBA). AppLab is very similar to our design,
> although it is bioinfo-centric, as is NetGenics SYNERGY. How do we
> decouple the nature of the data from the data-framework itself?

As an example of how separate the data is from the data-framework, the data is
kept as XML, and the data-frameowrk communicates via CORBA.  These are very
different models for data management and do not mix very well.  All the CORBA
system needs to know is that there is some text (XML) that needs to go
somewhere.

At least that's the way I see it.

> > genbank - genbank (not needed)
> > pdb - pdb         (not needed)
> > fasta - fasta     (not needed)
> > bsml - bsml       (not needed)
> 
> Ineresting.  What scenario do you envision for this data 'passthru'
> scheme?

It's simple.  We're just connecting output to input in every case.  Data
conversion means making 2 extra connections (adding a converter).  If data
conversion is not needed, which is true in the case I mentioned above, you just
leave the converter out.  If everything must be converted to an 'internal
format', which is something I'm arguing against, then data conversion can never
be left out.

> A genbank doc could be connected to a genbank-readable
> processor/widget/whatever without the need of passing thru a convertor,
> therefore, no wasted conversion time or resources.

Right.

> Similarly, a
> convertor could be constructed to realize that internal format
> conversion is not necessary and simply relay the data (ie in dynamic
> situations where the format of the incoming data, or the requirements of
> the receiving locus are not known in advance).  Comments anyone?

IMO, you do not want to relay data when the format is unknown.  I don't know,
maybe it should just be saved.

> What are you thinking? I recall from the bioobjects project (from the
> bioinformatics journal .pdf that gotcirculated a while back)

Shhh!

> that the
> biosequence data is abstracted into its basic types: raw sequence data,
> internal id, Locus or Accession number,  references (including
> bibliographic, organism, etc), x-refs to other databases, and feature
> information. these data structures are then assembled into an object and
> stored in and object database for access via CORBA....

It's good to see someone reads the references I send them :-)

The BioObjects project is certainly not data format independent.  We could come
up with a nice XML-based, cross-referenced data format for our neutral
conversions.

> > We have to ask ourselves this when thinking about the conversion process:
> >
> >   How is Loci going to handle data from the Genome Projects, where an
> >   annotated file may be gigabytes to terabytes in size???
> 
> I'm stymied on this one... ;-)

Well, we have to consider this.  The trend in bioinformatics has been toward a
great increase in the size of documents.

Cheers.
Jeff
-- 
                         +----------------------------+
                         |        J.W. Bizzaro        |
                         |  jeff at bioinformatics.org   |
                         |                            |
                         |        THE OPEN LAB        |
                         | Open Source Bioinformatics |
                         |                            |
                         | http://bioinformatics.org/ |
                         +----------------------------+