[Pipet Devel] Data Storage Interfaces

Mon Jun 14 15:16:34 EDT 1999

justin at ukans.edu said:
> That's a good point.  We haven't considered slow networks or very
> large files.

> I have, and that's my problem with  1) how Paos passes objects -- it
> sends the whole thing. I would prefer
>    just sending updates. Breaking up the data into linked objects
> could be
>    an adequate compromise. 

Paos makes me nervous too. It looks complex, and I can't see what it buys us 
over CORBA. Orbit is already a standard part of gnome, and we may as well 
leverage as much as we can from other efforts.

> 2) the independently roaming object concept
> where it's passed directly
>    from tool to tool. Without a "home" everything has to be passed,
> and by
>    the end of a complex series, that could be a large object. 

> I'm beginning to think the optimal solution is a virtual interface (or
> set of optional interface) across all junctions. It's the most
> efficient (only what the receiving end wants is sent [and only the
> receiving end really knows what it wants]).

So, data objects have an URI, and a loci can request the data it needs by URI. 
The local locid can fetch remote data objects, and cache them.  Each part of a 
pipleline of loci can request only the data objects it needs.  Your local 
locus requests it be sent the results that it wants, and only those, and 
displays them for you.  This way only the necessary data objects need be 
transferred.

Imagine a service that annotates a blast search:

your locus sends the sequence data to the blast server, the blast server sends 
the matching genbank UID's to the annotation server, the annotation server may 
have a local copy of genbank, and gets the sequences from there, then sends 
the UID's and the feature annotations back to your local locus, which may have 
to fetch some of the UID's from genbank, then applies the annotations and 
displays the result.

> It's completely language
> independent, as well as "junction" indepedent (each end has a standard
> interface, regardless of whether a C, Python, or Perl script is on the
> other end, or whether the two are communication via CORBA, TCP/IP, UDP/
> IP, shared memory, a pipe, a dynamically-loaded plug-in interface).

This sounds good, and can help make sure we don't overcommit to PAOS.  We just 
need a simple way of communicating between loci, "here's this data, please run 
foo v2 on it", "have your results, formatted for bar v1"

> This interface method requires a home location where the object
> resides throughout its processing life-time. This is what I had
> envisioned the work flow system to be (ie. coordinating it's various
> objects, where and when they connected, etc). This could be located on
> the client machine, and it allows the various other loci to be really
> dumb (which means small). 

Data objects can be identified by URI's with special URI's for data on a local 
disk (the locid will have to have some way to service requests for your local 
data, possibly from multiple loci).

But now say we want to run a five step pipeline on 2GB worth of genomic 
sequences, each of the five loci may want a copy of the sequence, which means 
our machine will send the file five times. Try that over a modem!

Caching at loci hubs can help solve this problem.
-- 
Humberto Ortiz Zuazaga
Bioinformatics Specialist
Institute of Neurobiology
hortiz at neurobio.upr.clu.edu