Hey all... I went to an interesting seminar yesterday at U of Chicago. Susan Davidson (co-director, Center for Bioinformatics, UPenn) gave a talk on "Refreshing the Tower of Babel." Caveat: I know very little about databases. The application: EpoDB, a database created at UPenn Center for Bioinformatics, designed to study gene regulation during differentiation and development of vertebrate red blood cells. The problems: extracting data from a sorts of databases with different underlying structures; cleansing the data (error removal); integration; annotation; updating (particularly, updating without losing the information added/removed during data cleansing). I guess Susan is a strong proponent in the DB field for complex value databases (blah blah blah ginger... don't ask me what those are). However, for this problem, she and her colleagues have chosen to use XML, modifying it a bit into something they call WHAX. The data can be represented as a "WHAX tree", with the tag representing the branches and the tag value representing the node. Additions to the a subset of the data can be integrated into the larger database by simple manipulations of WHAX trees. I originally went because of the application to genetic data. But then I got sidetracked... Here at the Museum, we have specimen data (21+ million specimens in total) in which species names change, higher taxonomic information changes, and so on, all of which should be tracked within the database. In some cases, we are integrating the traditional genetic data into our specimen databases; i.e., in newer portions of our collection of specimens, we have a one-to-one correspondence between the dead dried pressed plant (or the stuffed animal and corresponding skeleton), the DNA extracted from said plant (or animal), and a record in our developing databases (birds are separate from plants are separate from fishes...). The computer scientists were intrigued by this type of data :) This WHAX "thing" would be perfect for tracking all that information. Perhaps "bioinformatics" is currently too narrowly defined (organisms have more characteristics about them than just their DNA). If we, the community of manipulators of biological data, do come up with an open standard for representing said data, that standard should be flexible enough to encompass all the characteristics about the organisms. And, in light of all the stupid patenting going on, perhaps an open standard is needed before big bad multinational corporation patents it first. Just a few thoughts... -jennifer -------------------------- J. Steinbachs, PhD Computational Biologist Dept of Botany The Field Museum Chicago, IL 60605-2496 office: 312-665-7810 fax: 312-665-7158 --------------------------