[Pdbwiki-devel] Solutions for hosting the PDBWiki update

Henning Stehr stehr at molgen.mpg.de
Thu Dec 2 12:09:28 EST 2010


Hi guys,

following up on Jeff's mail, I have checked what resources we would
need for the PDB mirror / PDBWiki update:

The main requirement is disk space. There are different scenarios
which differ in the requirements for space and coding effort:

Current situation (no coding effort but not very efficient use of resources):
114GB for the complete PDB file tree (excl. unzipped files)
62 GB for unzipped mmCIF files
49 GB for unzipped pdb files
100GB for pdbase DB
100GB for pdbase_save DB (last week's snapshot)
100GB for pdbase_update DB (only required during the update)
29GB for image files for all pdbs
plus some space for software (pymol, openMMS, pywikipedia), log files,
temp storage
Total: 555GB

Realistic solution (minimal coding effort, use only what is strictly
required for PDBWiki update)
14GB for zipped mmCIF files
62GB for unzipped mmCIF files
100GB for pdbase DB
<1GB for software, logs, temp storage
Total: 177GB (200GB with some safety margin)

Minimal solution (we'd have to change the update pipeline but resource
usage would be drastically reduced):
14GB for zipped mmCIF files
1GB of temp space for unzipped files, pdbase, new wiki pages and images
Total: 15GB (20GB with safety margin)

Explanation:
- currently, we first load the whole pdbase and then generate two
lists of files: 1. entries in pdbwiki, 2. entries in pdbase
- then we compare the timestamps of the two to decide what needs to be updated
- to avoid loading all ~70.000 entries to pdbase, we could parse the
zipped cif files and extract the timestamp to generate the update list
- then we can unzip only the ones to be updated, load them to pdbase,
create the images, update PDBWiki and then delete all temp files
- this will also speed up the update from >3 days to hopefully less than one day
- disadvantage is that we don't have a convenient PDB mirror for other projects

Did I forget anything?
What options do we have for either solution?
bio.cc?
work Linux server?
home Linux server?

Oh, I forgot, we also need 8GB RAM for loading the largest PDB entries
to pdbase. That actually rules out the Amazon solution (which was too
expensive anyways).

Cheers,
Henning



More information about the Pdbwiki-devel mailing list