[Pipet Devel] databases

Thu Dec 2 11:03:37 EST 1999

Hey all...

I went to an interesting seminar yesterday at U of Chicago.  Susan
Davidson (co-director, Center for Bioinformatics, UPenn) gave a talk on
"Refreshing the Tower of Babel." 

Caveat: I know very little about databases.

The application: EpoDB, a database created at UPenn Center for
Bioinformatics, designed to study gene regulation during differentiation
and development of vertebrate red blood cells.

The problems:  extracting data from a sorts of databases with different
underlying structures; cleansing the data (error removal); integration;
annotation; updating (particularly, updating without losing the
information added/removed during data cleansing).

I guess Susan is a strong proponent in the DB field for complex value
databases (blah blah blah ginger... don't ask me what those are). 
However, for this problem, she and her colleagues have chosen to use
XML, modifying it a bit into something they call WHAX.

The data can be represented as a "WHAX tree", with the tag representing
the branches and the tag value representing the node.  Additions to the
a subset of the data can be integrated into the larger database by
simple manipulations of WHAX trees.

I originally went because of the application to genetic data.  But then
I got sidetracked...  Here at the Museum, we have specimen data (21+
million specimens in total) in which species names change, higher
taxonomic information changes, and so on, all of which should be tracked
within the database.  In some cases, we are integrating the traditional
genetic data into our specimen databases; i.e., in newer portions of our
collection of specimens, we have a one-to-one correspondence between the
dead dried pressed plant (or the stuffed animal and corresponding
skeleton), the DNA extracted from said plant (or animal), and a record
in our developing databases (birds are separate from plants are separate
from fishes...).  The computer scientists were intrigued by this type of
data :)  This WHAX "thing" would be perfect for tracking all that
information.

Perhaps "bioinformatics" is currently too narrowly defined (organisms
have more characteristics about them than just their DNA).  If we, the
community of manipulators of biological data, do come up with an open
standard for representing said data, that standard should be flexible
enough to encompass all the characteristics about the organisms.  And,
in light of all the stupid patenting going on, perhaps an open standard
is needed before big bad multinational corporation patents it first.

Just a few thoughts...
-jennifer

--------------------------
J. Steinbachs, PhD
Computational Biologist
Dept of Botany
The Field Museum
Chicago, IL 60605-2496

office: 312-665-7810
fax: 312-665-7158
--------------------------