[OpenMMS] Batch load from gz files and incremental update

Thu Jul 28 10:04:54 EDT 2011

So some good news regarding OpenMMS pdbase batch pipeline

I've managed to fix the reading directly from gzip issue! Basically the
support for reading from gz was already there, but there was a bug in the
actual zip reading, it seems that they were using some old zip parser from
java which supported another zip format. Don't know exactly why, but it
worked when I used the GZipInputStream class. I can only guess that the
pre-remediated cif files were compressed in zip format and the
post-remediation ones in gzip and the 2 formats are not the same (sorry I
don't know much about compression)

The new fixed jar is called OpenMMSbatch.jar and is in the pdbase dir in
svn. To recreate it one only needs to check out the openmms/java dir from
svn and it should all be self contained and build directly in eclipse. Once
there to generate the jar do Export->as Runnable Jar and this creates a self
contained jar that includes the mysql connector.

Second thing I've done is fix a few issues with hard-coded db parameters and
upper case of table names in the load scripts (it worked in lower case in
molgen because we were using mysql in ignore-case mode)

So now it should be portable enough to work in any site as long as you have
an rsync copy of the PDB mmCIF repo.

Next step would be making it work in incremental mode. The good news is one
can upload by batches (BTW the loader does nothing if one tries to load an
already loaded file). Then the loader can also do deletion of entries by
passing a command like:

java -cp OpenMMSbatch.jar org.rcsb.openmms.apps.rdb.
PDBase LenientParse \
data=/nfs/data/dbs/pdb/data/structures/all/mmCIF \
manifest=file:///nfs/data/dbs/pdb/ls-lR \
log=PDBASE.LOG \
exclude=ExcludeStructureIDs.list \
entries=102M,102D,103D,105D \
pdblist=allpdb_ex1.list \
dbUrl=jdbc:mysql://localhost/pdbase dbDrv=com.mysql.jdbc.Driver dbUsr=user
dbPwd=pwd \
action=DeleteSingleEntry

At the moment loadpdb.sh will only do a full load from scratch (or a batch
load of a few entries). In principle it's possible to modify it to work in
an incremental mode. I'll post again if I do so.

For the record the relevant dirs in the svn repo are svn://
bioinformatics.org/svnroot/pdbwiki/trunk/openmms (modified java source code)
and svn://bioinformatics.org/svnroot/pdbwiki/trunk/pdbase (stand-alone jar
for openmms-batch and scripts to create pdbase from scratch)

Jose
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bioinformatics.org/pipermail/openmmsusers-general/attachments/20110728/a8879787/attachment.html>