[BiO BB] Final Call: Data Integration in the Life Sciences, DILS04

Dan Bolser dmb at mrc-dunn.cam.ac.uk
Mon Dec 1 06:30:41 EST 2003

Dear Sir/Madam,

I am having trouble meeting the deadline, but I believe my paper is
very relevant to the conference. I would like to ask if I can still
send my paper in the next couple of days?

Below I have reproduced the abstract.

I would be very grateful if you could consider this paper for submission
in a couple of days.

Yours sincerely,
Dan Bolser.

A major challenge in the post-genomic era is the integration, classification and
dissemination of of diverse sources of biological data. Data integration is
attractive for several reasons, providing context for biological analysis, allowing
different types of data to be cross-correlated, and providing a validation framework
for different sources of experimentally and computationally derived data.
Centralised data repositories have proven very effective at organising specific
kinds of data, both helping to coordinate the research community and providing
standards for the categorisation and distribution of the accumulated knowledge.
However, the underlying conceptual complexity of data in the biological domain, as
well as the continual development of new concepts, makes the task of producing a
'universal' data repository much more challenging. As a consequence many integrated
databases have become arbitrarily complex, with no intrinsic classification value.

Here, we emphasise a point which is not widely acknowledged by the database
community, that data modelling in the scientific domain is equivalent to the
scientific process itself. To this end we propose a 'distributed data model'
framework for data integration in the scientific domain. Different sources of data
to be integrated with will naturally have conceptual overlap (if they are to be
integrated at all), but may have very complex conceptual and/or algorithmic
associations. In this framework the burden of integration is put back on the domain
expert (to implement or devise specific integration strategies), but the underlying
issues of data access and subsequent distribution are made transparent via the data
model framework. The advantage of integrating and distributing data within this
framework is that new data and new concepts (produced as the result of integrative
analysis for example) are naturally accommodated by specific extensions to the data
models of the underlying data.

Here we develop components of a high level data model relating to the principal axes
of an integrated protein classification database (called the protein periodic
table). Additionally we have developed conceptually clear ORM style models for
integrated protein interaction data and metabolic pathway data (suitable for
metabolic reconstruction). The the subsequent analysis of these models highlights
several key areas for model development, and thus highlights areas for scientific
research into specific integration and classification techniques.

