[Biodevelopers] RDBMS and Bioinformatics

Tue Mar 16 18:51:45 EST 2004

On Tue, 2004-03-16 at 17:37, Dan Bolser wrote:

> > Basically the object<->relational mapping is on one-to-one and onto in
> > most cases, so you have to resort to "hacks" like serialization to make
> 
> Sorry, do you mean 'is not 1 to 1' ?

s/on one-to-one/not one-to-one/

yes. 

> 
> > sure information and state are not lost (my apologies to those who do
> > not consider serialization to be a hack).  Objects can be rich and
> > dynamic data structures which can be represented by an XML document to a
> > degree (apart from the code elements), and can better represent dynamic
> > data.
> 
> I follow. I guess it is rare that people make large amounts of data
> available via XML (i.e. using XML as a database). The way you describe

Well... there are some folks who are using XML as the database.  Some
systems are known to spit out several gigs of XML.  Makes parsing it in
the traditional tree modality somewhat hard.  This happens often enough
that people write methods to handle the symptom.  See the XML::Twig perl
module.

Basically parsing a tree is easy if it is all in memory.  It gets ...
more complicated ... if portions of the tree have to reside in a
secondary storage mechanism.  

I look at XML as more of a "portable" way to represent complex data. 
RDBMS's are not portable in a binary sense (in most cases I am aware of)
across ABI's.  Look at XML as akin to ASN.1.  They are not the same, but
generally serve similar functions.  It is however, somewhat hard to read
binary ASN.1 data, and infer the structure from the file.  What is
really nice about XML is it is for the most part programming language
and platform independent.  I am not sure if the tags can be Unicode, so
it might not be human language independent.

The nice thing about XML is that the structure of the document maps well
into the structure of the data it represents.  

> sounds like a good use of XML - giving / transporting data about a
> programs internal state.
> 
> > They generally solve different problems, though there is overlap.  
> 
> I am still a bit confused. I can't help thinking of dia, which makes
> exelent use of XML to represent diagrams, and so has easy interchange with
> lots of tools - i.e. good use of XML, it woudl be crazy to run dia off an
> RDB. 

To a degree this is correct.  If the XML document represented a
connected set of tables, you could map that to an RDBMS.  However, it
would be hard to generate the diagram itself from the RDBMS (e.g. it is
easy to encode data in an RDBMS, but hard to encode structure, though
searching is easy).  The XML could represent a richer non-tabular
system, in which case the XML can take on the necessary structure to
represent the system (e.g. it is easy to encode structure in XML, as
well as the data which resides in the structure, though searching is
hard).

> But what is the point in creating biological data in this form, when the
> 'data model' is basically our own concept about the data?

One of these days someone is going to extend Go"del's incompleteness
proof for biological systems. 

> Wouldn't a SwissProt RDB be much more sensible than an XML document?

Only if the Swissprot never changes format.  The whole point of XML is
the "X".  Extensible.  If you want to integrate portions of Swissprot
into your own research DB, you can do this, but you would either have to
deal with the Swissprot normalization model, or datamart the swissprot
and create your own normalization .

Some of this comes from the bias of the developers as well.  It is hard
to transport RDBMS's portably.  There are whole companies devoted to EDI
that do nothing but this (for other industries).  XML greatly simplifies
the EDI.  It is not a silver bullet, but it is helpful for data
exchange.  If you get your results back in tabular form (RDBMS) or
structural form (XML) from a query, does it matter what the underlying
data storage technology is?