[Pdbwiki-devel] Solutions for hosting the PDBWiki update

Henning Stehr stehr at molgen.mpg.de
Thu Dec 2 12:09:28 EST 2010

Hi guys,

following up on Jeff's mail, I have checked what resources we would
need for the PDB mirror / PDBWiki update:

The main requirement is disk space. There are different scenarios
which differ in the requirements for space and coding effort:

Current situation (no coding effort but not very efficient use of resources):
114GB for the complete PDB file tree (excl. unzipped files)
62 GB for unzipped mmCIF files
49 GB for unzipped pdb files
100GB for pdbase DB
100GB for pdbase_save DB (last week's snapshot)
100GB for pdbase_update DB (only required during the update)
29GB for image files for all pdbs
plus some space for software (pymol, openMMS, pywikipedia), log files,
temp storage
Total: 555GB

Realistic solution (minimal coding effort, use only what is strictly
required for PDBWiki update)
14GB for zipped mmCIF files
62GB for unzipped mmCIF files
100GB for pdbase DB
<1GB for software, logs, temp storage
Total: 177GB (200GB with some safety margin)

Minimal solution (we'd have to change the update pipeline but resource
usage would be drastically reduced):
14GB for zipped mmCIF files
1GB of temp space for unzipped files, pdbase, new wiki pages and images
Total: 15GB (20GB with safety margin)

- currently, we first load the whole pdbase and then generate two
lists of files: 1. entries in pdbwiki, 2. entries in pdbase
- then we compare the timestamps of the two to decide what needs to be updated
- to avoid loading all ~70.000 entries to pdbase, we could parse the
zipped cif files and extract the timestamp to generate the update list
- then we can unzip only the ones to be updated, load them to pdbase,
create the images, update PDBWiki and then delete all temp files
- this will also speed up the update from >3 days to hopefully less than one day
- disadvantage is that we don't have a convenient PDB mirror for other projects

Did I forget anything?
What options do we have for either solution?
work Linux server?
home Linux server?

Oh, I forgot, we also need 8GB RAM for loading the largest PDB entries
to pdbase. That actually rules out the Amazon solution (which was too
expensive anyways).


