> > Example, image we want to support 4 file formats: > > > > genbank - internal > > pdb - internal > > fasta - internal > > bsml - internal > > On the other hand, the requirement of an internal format would force Loci to do > 2 conversions: > > genbank - internal - genbank > pdb - internal - pdb > fasta - internal - fasta > bsml - internal - bsml Like Humberto mentioned recently, pbmtools automatically converts everything to an internal format, then reparses it out to the desired format. NCBI does the same with their all-encompassing ASN.1 format. So, in the data-independent Loci model, how would Loci's internal format be implemented as a plug-in? Would the plug-in developer be responsible to create a locus that converts all incoming data to an internal format, like so: ______ |Data|<----idl/api<----Request document |Base| | | ______----->idl/api---->document--->conversion--------->processing-->result---storage(database, to "loci" internal (file, etc). format-then parse to required format If this is the model, its seems to me that a great deal of work would befall on the plug-in developers, and the Loci framework itself would be quite minimal (which is not a bad thing). This begs the question, how will the loci plug-in to the Loci architecture? What would Loci be, at this data-independent core? I'm not an expert on network-object models, or data object models, or databases for that matter, so these issues frighten and confuse me. I'm beginning to write up the Loci white-pages, so some enlightenment on these issues would go a long way to help me write intellible stuff! I have read a bit about AppLab (a Java-based command-line application wrapper that runs throught CORBA). AppLab is very similar to our design, although it is bioinfo-centric, as is NetGenics SYNERGY. How do we decouple the nature of the data from the data-framework itself? > genbank - genbank (not needed) > pdb - pdb (not needed) > fasta - fasta (not needed) > bsml - bsml (not needed) Ineresting. What scenario do you envision for this data 'passthru' scheme? A genbank doc could be connected to a genbank-readable processor/widget/whatever without the need of passing thru a convertor, therefore, no wasted conversion time or resources. Similarly, a convertor could be constructed to realize that internal format conversion is not necessary and simply relay the data (ie in dynamic situations where the format of the incoming data, or the requirements of the receiving locus are not known in advance). Comments anyone? > > > vs converters between the same 4 formats: > > > > genbank - pdb > > genbank - fasta > > genbank - bsml > > pdb - fasta > > pdb - bsml > > fasta - bsml > > Maybe if we use a temporary, intermediate format used only during conversion, it > would be much simpler plugging 2 formats together. From my experience writing > converters, some intermediate format is usually needed anyway. > > So, each converter is built by connecting 2 parts together: > > genbank - internal <---> internal - pdb > pdb - internal <---> internal - bsml > fasta - internal <---> internal - genbank > bsml - internal <---> internal - fasta > > The problem is, yeah, we're still doing 2 conversions. But if the 'internal' > format is not a file format (such as XML), it should be quicker and require less > disk space. What are you thinking? I recall from the bioobjects project (from the bioinformatics journal .pdf that gotcirculated a while back) that the biosequence data is abstracted into its basic types: raw sequence data, internal id, Locus or Accession number, references (including bibliographic, organism, etc), x-refs to other databases, and feature information. these data structures are then assembled into an object and stored in and object database for access via CORBA.... > > > What we had decided is that we can defer defining our file formats until we > > actually have any loci that use them, and that we can have many small > > languages instead of a big language that tries to capture all possible data > > types. > > All I am against is just that: a big language that tries to capture all possible > types, therefore needing to be redefined each time we add a new file format to > Loci, and requiring file system reads/writes each time a conversion is done. > > We have to ask ourselves this when thinking about the conversion process: > > How is Loci going to handle data from the Genome Projects, where an > annotated file may be gigabytes to terabytes in size??? I'm stymied on this one... ;-) > > > So we'll have an internal format for nucleotide sequences, one for amino acid > > sequences, one for multi sequence objects, one for sequence annotations, one > > for bibliographic references, ... > > I'd agree to having an internal format that is really many smaller specialized > formats, if you'd agree that they are used to BUILD converters, for QUICK > conversions, only AS NEEDED, like I wrote above. --gary =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Gary Van Domselaar gvd at redpoll.pharmacy.ualberta.ca Faculty of Pharmacy Phone: (780) 492-4493 University of Alberta FAX: (780) 492-5305 Edmonton, Alberta, Canada http://redpoll.pharmacy.ualberta.ca/~gvd