[OpenMMS] Clean up of PDB data? Community Freebase Gridworks project?

Dan Bolser dan.bolser at gmail.com
Thu May 13 04:16:14 EDT 2010


Hi all,

There is an interesting tool called 'Freebase Gridworks':

http://code.google.com/p/freebase-gridworks/


Basically it makes 'cleaning up' tables of data really easy,
automating most of the things you find yourself doing when presented
with 'user input', including interrogating the data for
inconsistencies.

It seems ideal for 'looking at' biological data, so I decided to test
it out on the mmCIF Entity table from a recent dump of the PDB (May
9th).

The tool quickly allowed the identification of the following list of
22 inconsistencies in the data (focusing initially on the 52,289 water
entities, which are by far the most standard of the three types of
entity):

* 52288 water entities have the 'details' field set to '?', in entry
1dcn it's set to "ARGININOSUCCINATE BOUND TO ONE ACTIVE SITE".

* 41030 water entities have the 'pdbx_mutation' field set to '?' and
another 11256 are NULL. In entry 3igf it's "A127T", in 1tm0 it's
"protein has C-tag (LEHHHHHH)", and in 2huk it's "C97A".

* The 'pdbx_ec' field is set to other than '?' or NULL in four cases,
1fp3, 3dhy, 3d7s, and 1mpx.

* The 'pdbx_fragment' field is set to other than '?' or NULL in 14
cases. In 8 cases it's set to water, Water or WATER (1dqd, 1em8, 1ijk,
1pnz, 1po0, 1rc5, 1yvm, and 2g75). The six remaining cases, including
the interesting "WATER MOLECULES WITH RESIDUE NUMBER 1011 AND 1053 ARE
MOST PROBABLY AMMONIUM IONS." are: 1d5w, 1ee4, 1eke, 1ijq, 1rc7, and
3l0l.



I'm planning to keep looking at the various tables in the PDB,
however, my hope is that these changes can be pushed back onto the
central PDB archive. Previously, changes like this were not readily
updated into the archive, is this still true (post-remediation)?

I'm not sure how best to use Freebase Gridworks collaboratively.
Ideally we could all work together on the same remediation project,
however, I'm not sure if it supports that kind of collaborative
editing. In any case, I'm happy to share my current project file with
anyone who is interested. It should be interesting to start clustering
values to detect category typos and reconciling species names against
Freebase.


Cheers,
Dan.



More information about the OpenMMSusers-general mailing list