From jose.m.duarte at gmail.com  Thu Jul 28 10:04:54 2011
From: jose.m.duarte at gmail.com (Jose M. Duarte)
Date: Thu, 28 Jul 2011 16:04:54 +0200
Subject: [OpenMMS] Batch load from gz files and incremental update
Message-ID: <CAML8USwnnSEFTGdB0jLg2vrq01d_kGacVPHyTnTuX09+iKBdBQ@mail.gmail.com>

So some good news regarding OpenMMS pdbase batch pipeline

I've managed to fix the reading directly from gzip issue! Basically the
support for reading from gz was already there, but there was a bug in the
actual zip reading, it seems that they were using some old zip parser from
java which supported another zip format. Don't know exactly why, but it
worked when I used the GZipInputStream class. I can only guess that the
pre-remediated cif files were compressed in zip format and the
post-remediation ones in gzip and the 2 formats are not the same (sorry I
don't know much about compression)

The new fixed jar is called OpenMMSbatch.jar and is in the pdbase dir in
svn. To recreate it one only needs to check out the openmms/java dir from
svn and it should all be self contained and build directly in eclipse. Once
there to generate the jar do Export->as Runnable Jar and this creates a self
contained jar that includes the mysql connector.

Second thing I've done is fix a few issues with hard-coded db parameters and
upper case of table names in the load scripts (it worked in lower case in
molgen because we were using mysql in ignore-case mode)

So now it should be portable enough to work in any site as long as you have
an rsync copy of the PDB mmCIF repo.

Next step would be making it work in incremental mode. The good news is one
can upload by batches (BTW the loader does nothing if one tries to load an
already loaded file). Then the loader can also do deletion of entries by
passing a command like:

java -cp OpenMMSbatch.jar org.rcsb.openmms.apps.rdb.
PDBase LenientParse \
data=/nfs/data/dbs/pdb/data/structures/all/mmCIF \
manifest=file:///nfs/data/dbs/pdb/ls-lR \
log=PDBASE.LOG \
exclude=ExcludeStructureIDs.list \
entries=102M,102D,103D,105D \
pdblist=allpdb_ex1.list \
dbUrl=jdbc:mysql://localhost/pdbase dbDrv=com.mysql.jdbc.Driver dbUsr=user
dbPwd=pwd \
action=DeleteSingleEntry

At the moment loadpdb.sh will only do a full load from scratch (or a batch
load of a few entries). In principle it's possible to modify it to work in
an incremental mode. I'll post again if I do so.

For the record the relevant dirs in the svn repo are svn://
bioinformatics.org/svnroot/pdbwiki/trunk/openmms (modified java source code)
and svn://bioinformatics.org/svnroot/pdbwiki/trunk/pdbase (stand-alone jar
for openmms-batch and scripts to create pdbase from scratch)

Jose
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bioinformatics.org/pipermail/openmmsusers-general/attachments/20110728/a8879787/attachment.html>

From dan.bolser at gmail.com  Thu Jul 28 10:25:49 2011
From: dan.bolser at gmail.com (Dan Bolser)
Date: Thu, 28 Jul 2011 15:25:49 +0100
Subject: [OpenMMS] Batch load from gz files and incremental update
In-Reply-To: <CAML8USwnnSEFTGdB0jLg2vrq01d_kGacVPHyTnTuX09+iKBdBQ@mail.gmail.com>
References: <CAML8USwnnSEFTGdB0jLg2vrq01d_kGacVPHyTnTuX09+iKBdBQ@mail.gmail.com>
Message-ID: <CAPBO=2nRpWwW42_zwaowR=ys4OAvX1AybibdFuiEbUZx6X1MWQ@mail.gmail.com>

Great work Jose!

Good to see that the project is alive!

On 28 July 2011 15:04, Jose M. Duarte <jose.m.duarte at gmail.com> wrote:
> So some good news regarding OpenMMS pdbase batch pipeline
>
> I've managed to fix the reading directly from gzip issue! Basically the
> support for reading from gz was already there, but there was a bug in the
> actual zip reading, it seems that they were using some old zip parser from
> java which supported another zip format. Don't know exactly why, but it
> worked when I used the GZipInputStream class. I can only guess that the
> pre-remediated cif files were compressed in zip format and the
> post-remediation ones in gzip and the 2 formats are not the same (sorry I
> don't know much about compression)
>
> The new fixed jar is called OpenMMSbatch.jar and is in the pdbase dir in
> svn. To recreate it one only needs to check out the openmms/java dir from
> svn and it should all be self contained and build directly in eclipse. Once
> there to generate the jar do Export->as Runnable Jar and this creates a self
> contained jar that includes the mysql connector.
>
> Second thing I've done is fix a few issues with hard-coded db parameters and
> upper case of table names in the load scripts (it worked in lower case in
> molgen because we were using mysql in ignore-case mode)
>
> So now it should be portable enough to work in any site as long as you have
> an rsync copy of the PDB mmCIF repo.
>
> Next step would be making it work in incremental mode. The good news is one
> can upload by batches (BTW the loader does nothing if one tries to load an
> already loaded file). Then the loader can also do deletion of entries by
> passing a command like:
>
> java -cp OpenMMSbatch.jar org.rcsb.openmms.apps.rdb.
> PDBase LenientParse \
> data=/nfs/data/dbs/pdb/data/structures/all/mmCIF \
> manifest=file:///nfs/data/dbs/pdb/ls-lR \
> log=PDBASE.LOG \
> exclude=ExcludeStructureIDs.list \
> entries=102M,102D,103D,105D \
> pdblist=allpdb_ex1.list \
> dbUrl=jdbc:mysql://localhost/pdbase dbDrv=com.mysql.jdbc.Driver dbUsr=user
> dbPwd=pwd \
> action=DeleteSingleEntry
>
> At the moment loadpdb.sh will only do a full load from scratch (or a batch
> load of a few entries). In principle it's possible to modify it to work in
> an incremental mode. I'll post again if I do so.
>
> For the record the relevant dirs in the svn repo are
> svn://bioinformatics.org/svnroot/pdbwiki/trunk/openmms (modified java source
> code) and svn://bioinformatics.org/svnroot/pdbwiki/trunk/pdbase (stand-alone
> jar for openmms-batch and scripts to create pdbase from scratch)
>
> Jose
>
> _______________________________________________
> OpenMMSusers-general mailing list
> OpenMMSusers-general at bioinformatics.org
> http://www.bioinformatics.org/mailman/listinfo/openmmsusers-general
>
>


From stehr at molgen.mpg.de  Thu Jul 28 10:50:25 2011
From: stehr at molgen.mpg.de (Henning Stehr)
Date: Thu, 28 Jul 2011 16:50:25 +0200
Subject: [OpenMMS] Batch load from gz files and incremental update
In-Reply-To: <CAML8USwnnSEFTGdB0jLg2vrq01d_kGacVPHyTnTuX09+iKBdBQ@mail.gmail.com>
References: <CAML8USwnnSEFTGdB0jLg2vrq01d_kGacVPHyTnTuX09+iKBdBQ@mail.gmail.com>
Message-ID: <CAPLouure=96qrF7VkwYUn1KRCLyhkChfhU73beK=yU8vtnzn-Q@mail.gmail.com>

> I can only guess that the pre-remediated cif files were compressed in zip format and the

It is like you say. So it's not really a bug. The PDB used a different
zip format in the old times with a .Z extension. Then they switched to
.gz after they had already stopped support for OpenMMS.
Good to see that it is still useful though.


On Thu, Jul 28, 2011 at 4:04 PM, Jose M. Duarte <jose.m.duarte at gmail.com> wrote:
> So some good news regarding OpenMMS pdbase batch pipeline
>
> I've managed to fix the reading directly from gzip issue! Basically the
> support for reading from gz was already there, but there was a bug in the
> actual zip reading, it seems that they were using some old zip parser from
> java which supported another zip format. Don't know exactly why, but it
> worked when I used the GZipInputStream class. I can only guess that the
> pre-remediated cif files were compressed in zip format and the
> post-remediation ones in gzip and the 2 formats are not the same (sorry I
> don't know much about compression)
>
> The new fixed jar is called OpenMMSbatch.jar and is in the pdbase dir in
> svn. To recreate it one only needs to check out the openmms/java dir from
> svn and it should all be self contained and build directly in eclipse. Once
> there to generate the jar do Export->as Runnable Jar and this creates a self
> contained jar that includes the mysql connector.
>
> Second thing I've done is fix a few issues with hard-coded db parameters and
> upper case of table names in the load scripts (it worked in lower case in
> molgen because we were using mysql in ignore-case mode)
>
> So now it should be portable enough to work in any site as long as you have
> an rsync copy of the PDB mmCIF repo.
>
> Next step would be making it work in incremental mode. The good news is one
> can upload by batches (BTW the loader does nothing if one tries to load an
> already loaded file). Then the loader can also do deletion of entries by
> passing a command like:
>
> java -cp OpenMMSbatch.jar org.rcsb.openmms.apps.rdb.
> PDBase LenientParse \
> data=/nfs/data/dbs/pdb/data/structures/all/mmCIF \
> manifest=file:///nfs/data/dbs/pdb/ls-lR \
> log=PDBASE.LOG \
> exclude=ExcludeStructureIDs.list \
> entries=102M,102D,103D,105D \
> pdblist=allpdb_ex1.list \
> dbUrl=jdbc:mysql://localhost/pdbase dbDrv=com.mysql.jdbc.Driver dbUsr=user
> dbPwd=pwd \
> action=DeleteSingleEntry
>
> At the moment loadpdb.sh will only do a full load from scratch (or a batch
> load of a few entries). In principle it's possible to modify it to work in
> an incremental mode. I'll post again if I do so.
>
> For the record the relevant dirs in the svn repo are
> svn://bioinformatics.org/svnroot/pdbwiki/trunk/openmms (modified java source
> code) and svn://bioinformatics.org/svnroot/pdbwiki/trunk/pdbase (stand-alone
> jar for openmms-batch and scripts to create pdbase from scratch)
>
> Jose
>
> _______________________________________________
> OpenMMSusers-general mailing list
> OpenMMSusers-general at bioinformatics.org
> http://www.bioinformatics.org/mailman/listinfo/openmmsusers-general
>
>