From jose.m.duarte at gmail.com Thu Jul 28 10:04:54 2011 From: jose.m.duarte at gmail.com (Jose M. Duarte) Date: Thu, 28 Jul 2011 16:04:54 +0200 Subject: [OpenMMS] Batch load from gz files and incremental update Message-ID: So some good news regarding OpenMMS pdbase batch pipeline I've managed to fix the reading directly from gzip issue! Basically the support for reading from gz was already there, but there was a bug in the actual zip reading, it seems that they were using some old zip parser from java which supported another zip format. Don't know exactly why, but it worked when I used the GZipInputStream class. I can only guess that the pre-remediated cif files were compressed in zip format and the post-remediation ones in gzip and the 2 formats are not the same (sorry I don't know much about compression) The new fixed jar is called OpenMMSbatch.jar and is in the pdbase dir in svn. To recreate it one only needs to check out the openmms/java dir from svn and it should all be self contained and build directly in eclipse. Once there to generate the jar do Export->as Runnable Jar and this creates a self contained jar that includes the mysql connector. Second thing I've done is fix a few issues with hard-coded db parameters and upper case of table names in the load scripts (it worked in lower case in molgen because we were using mysql in ignore-case mode) So now it should be portable enough to work in any site as long as you have an rsync copy of the PDB mmCIF repo. Next step would be making it work in incremental mode. The good news is one can upload by batches (BTW the loader does nothing if one tries to load an already loaded file). Then the loader can also do deletion of entries by passing a command like: java -cp OpenMMSbatch.jar org.rcsb.openmms.apps.rdb. PDBase LenientParse \ data=/nfs/data/dbs/pdb/data/structures/all/mmCIF \ manifest=file:///nfs/data/dbs/pdb/ls-lR \ log=PDBASE.LOG \ exclude=ExcludeStructureIDs.list \ entries=102M,102D,103D,105D \ pdblist=allpdb_ex1.list \ dbUrl=jdbc:mysql://localhost/pdbase dbDrv=com.mysql.jdbc.Driver dbUsr=user dbPwd=pwd \ action=DeleteSingleEntry At the moment loadpdb.sh will only do a full load from scratch (or a batch load of a few entries). In principle it's possible to modify it to work in an incremental mode. I'll post again if I do so. For the record the relevant dirs in the svn repo are svn:// bioinformatics.org/svnroot/pdbwiki/trunk/openmms (modified java source code) and svn://bioinformatics.org/svnroot/pdbwiki/trunk/pdbase (stand-alone jar for openmms-batch and scripts to create pdbase from scratch) Jose -------------- next part -------------- An HTML attachment was scrubbed... URL: From dan.bolser at gmail.com Thu Jul 28 10:25:49 2011 From: dan.bolser at gmail.com (Dan Bolser) Date: Thu, 28 Jul 2011 15:25:49 +0100 Subject: [OpenMMS] Batch load from gz files and incremental update In-Reply-To: References: Message-ID: Great work Jose! Good to see that the project is alive! On 28 July 2011 15:04, Jose M. Duarte wrote: > So some good news regarding OpenMMS pdbase batch pipeline > > I've managed to fix the reading directly from gzip issue! Basically the > support for reading from gz was already there, but there was a bug in the > actual zip reading, it seems that they were using some old zip parser from > java which supported another zip format. Don't know exactly why, but it > worked when I used the GZipInputStream class. I can only guess that the > pre-remediated cif files were compressed in zip format and the > post-remediation ones in gzip and the 2 formats are not the same (sorry I > don't know much about compression) > > The new fixed jar is called OpenMMSbatch.jar and is in the pdbase dir in > svn. To recreate it one only needs to check out the openmms/java dir from > svn and it should all be self contained and build directly in eclipse. Once > there to generate the jar do Export->as Runnable Jar and this creates a self > contained jar that includes the mysql connector. > > Second thing I've done is fix a few issues with hard-coded db parameters and > upper case of table names in the load scripts (it worked in lower case in > molgen because we were using mysql in ignore-case mode) > > So now it should be portable enough to work in any site as long as you have > an rsync copy of the PDB mmCIF repo. > > Next step would be making it work in incremental mode. The good news is one > can upload by batches (BTW the loader does nothing if one tries to load an > already loaded file). Then the loader can also do deletion of entries by > passing a command like: > > java -cp OpenMMSbatch.jar org.rcsb.openmms.apps.rdb. > PDBase LenientParse \ > data=/nfs/data/dbs/pdb/data/structures/all/mmCIF \ > manifest=file:///nfs/data/dbs/pdb/ls-lR \ > log=PDBASE.LOG \ > exclude=ExcludeStructureIDs.list \ > entries=102M,102D,103D,105D \ > pdblist=allpdb_ex1.list \ > dbUrl=jdbc:mysql://localhost/pdbase dbDrv=com.mysql.jdbc.Driver dbUsr=user > dbPwd=pwd \ > action=DeleteSingleEntry > > At the moment loadpdb.sh will only do a full load from scratch (or a batch > load of a few entries). In principle it's possible to modify it to work in > an incremental mode. I'll post again if I do so. > > For the record the relevant dirs in the svn repo are > svn://bioinformatics.org/svnroot/pdbwiki/trunk/openmms (modified java source > code) and svn://bioinformatics.org/svnroot/pdbwiki/trunk/pdbase (stand-alone > jar for openmms-batch and scripts to create pdbase from scratch) > > Jose > > _______________________________________________ > OpenMMSusers-general mailing list > OpenMMSusers-general at bioinformatics.org > http://www.bioinformatics.org/mailman/listinfo/openmmsusers-general > > From stehr at molgen.mpg.de Thu Jul 28 10:50:25 2011 From: stehr at molgen.mpg.de (Henning Stehr) Date: Thu, 28 Jul 2011 16:50:25 +0200 Subject: [OpenMMS] Batch load from gz files and incremental update In-Reply-To: References: Message-ID: > I can only guess that the pre-remediated cif files were compressed in zip format and the It is like you say. So it's not really a bug. The PDB used a different zip format in the old times with a .Z extension. Then they switched to .gz after they had already stopped support for OpenMMS. Good to see that it is still useful though. On Thu, Jul 28, 2011 at 4:04 PM, Jose M. Duarte wrote: > So some good news regarding OpenMMS pdbase batch pipeline > > I've managed to fix the reading directly from gzip issue! Basically the > support for reading from gz was already there, but there was a bug in the > actual zip reading, it seems that they were using some old zip parser from > java which supported another zip format. Don't know exactly why, but it > worked when I used the GZipInputStream class. I can only guess that the > pre-remediated cif files were compressed in zip format and the > post-remediation ones in gzip and the 2 formats are not the same (sorry I > don't know much about compression) > > The new fixed jar is called OpenMMSbatch.jar and is in the pdbase dir in > svn. To recreate it one only needs to check out the openmms/java dir from > svn and it should all be self contained and build directly in eclipse. Once > there to generate the jar do Export->as Runnable Jar and this creates a self > contained jar that includes the mysql connector. > > Second thing I've done is fix a few issues with hard-coded db parameters and > upper case of table names in the load scripts (it worked in lower case in > molgen because we were using mysql in ignore-case mode) > > So now it should be portable enough to work in any site as long as you have > an rsync copy of the PDB mmCIF repo. > > Next step would be making it work in incremental mode. The good news is one > can upload by batches (BTW the loader does nothing if one tries to load an > already loaded file). Then the loader can also do deletion of entries by > passing a command like: > > java -cp OpenMMSbatch.jar org.rcsb.openmms.apps.rdb. > PDBase LenientParse \ > data=/nfs/data/dbs/pdb/data/structures/all/mmCIF \ > manifest=file:///nfs/data/dbs/pdb/ls-lR \ > log=PDBASE.LOG \ > exclude=ExcludeStructureIDs.list \ > entries=102M,102D,103D,105D \ > pdblist=allpdb_ex1.list \ > dbUrl=jdbc:mysql://localhost/pdbase dbDrv=com.mysql.jdbc.Driver dbUsr=user > dbPwd=pwd \ > action=DeleteSingleEntry > > At the moment loadpdb.sh will only do a full load from scratch (or a batch > load of a few entries). In principle it's possible to modify it to work in > an incremental mode. I'll post again if I do so. > > For the record the relevant dirs in the svn repo are > svn://bioinformatics.org/svnroot/pdbwiki/trunk/openmms (modified java source > code) and svn://bioinformatics.org/svnroot/pdbwiki/trunk/pdbase (stand-alone > jar for openmms-batch and scripts to create pdbase from scratch) > > Jose > > _______________________________________________ > OpenMMSusers-general mailing list > OpenMMSusers-general at bioinformatics.org > http://www.bioinformatics.org/mailman/listinfo/openmmsusers-general > >