[Pdbwiki-devel] Solutions for hosting the PDBWiki update

Fri Dec 3 09:53:03 EST 2010

Thanks very much for this breakdown Henning.

Of course I favour the 'zero effort' solution, so I'll keep trying to
set things up on bio.cc

The problem setting up the mirror was the mysterious death of rsync,
and some weird zombie files that I have to look into. Assuming I can
kill the zombies, the rsync should be more stable in 'weekly update'
mode. I'll grab the loader code and set up a full mirror on bio.cc

If nothing else, it will be useful to have something running in
parallel, if perhaps not 100% stable.

Thanks again,
Dan.

On 2 December 2010 17:09, Henning Stehr <stehr at molgen.mpg.de> wrote:
> Hi guys,
>
> following up on Jeff's mail, I have checked what resources we would
> need for the PDB mirror / PDBWiki update:
>
> The main requirement is disk space. There are different scenarios
> which differ in the requirements for space and coding effort:
>
> Current situation (no coding effort but not very efficient use of resources):
> 114GB for the complete PDB file tree (excl. unzipped files)
> 62 GB for unzipped mmCIF files
> 49 GB for unzipped pdb files
> 100GB for pdbase DB
> 100GB for pdbase_save DB (last week's snapshot)
> 100GB for pdbase_update DB (only required during the update)
> 29GB for image files for all pdbs
> plus some space for software (pymol, openMMS, pywikipedia), log files,
> temp storage
> Total: 555GB
>
> Realistic solution (minimal coding effort, use only what is strictly
> required for PDBWiki update)
> 14GB for zipped mmCIF files
> 62GB for unzipped mmCIF files
> 100GB for pdbase DB
> <1GB for software, logs, temp storage
> Total: 177GB (200GB with some safety margin)
>
> Minimal solution (we'd have to change the update pipeline but resource
> usage would be drastically reduced):
> 14GB for zipped mmCIF files
> 1GB of temp space for unzipped files, pdbase, new wiki pages and images
> Total: 15GB (20GB with safety margin)
>
> Explanation:
> - currently, we first load the whole pdbase and then generate two
> lists of files: 1. entries in pdbwiki, 2. entries in pdbase
> - then we compare the timestamps of the two to decide what needs to be updated
> - to avoid loading all ~70.000 entries to pdbase, we could parse the
> zipped cif files and extract the timestamp to generate the update list
> - then we can unzip only the ones to be updated, load them to pdbase,
> create the images, update PDBWiki and then delete all temp files
> - this will also speed up the update from >3 days to hopefully less than one day
> - disadvantage is that we don't have a convenient PDB mirror for other projects
>
> Did I forget anything?
> What options do we have for either solution?
> bio.cc?
> work Linux server?
> home Linux server?
>
> Oh, I forgot, we also need 8GB RAM for loading the largest PDB entries
> to pdbase. That actually rules out the Amazon solution (which was too
> expensive anyways).
>
> Cheers,
> Henning
>
> _______________________________________________
> Pdbwiki-devel mailing list
> Pdbwiki-devel at bioinformatics.org
> http://www.bioinformatics.org/mailman/listinfo/pdbwiki-devel
> http://www.pdbwiki.org
>