[Pdbwiki-devel] Hosting the PDBWiki update pipeline

J.W. Bizzaro jeff at bioinformatics.org
Wed Oct 27 20:10:25 EDT 2010

So, let's see if I've got this right:

There's an RDBMS running on a server somewhere, and you can't use it after sometime in December.

This RDBMS requires a lot of time to initially import the PDB data.

Once the RDB is set up, you only need to make small updates to it (it connects to a mirror?).

OK, so let's say the new host for the RDBMS is on a separate server and IP address from the main Bioinformatics.Org webserver.  Where then would you want to run the PDBWiki website?   Does it matter?

And, are you talking about something like MySQL for the RDBMS?


On 10/27/10 6:34 PM, Dan Bolser wrote:
> No I don't think so, but Jose or Henning will know better.
> Currently we sync the PDB every week, but since only a tiny fraction
> of the PDB changes in that time, the synchronization step takes about
> 1 hour. Similarly, we only unzip those files that a) we will use and
> b) that have changed.
> The 'big job' is running the 'importer', which loads the PDB files
> into the relational database. This takes 24 hours, and reads all the
> files in one division of the PDB (the PDB has about 5 different
> divisions of roughly equal size).
> However, so long as that job takes<  1 week, we don't have any problem
> i.e. we could severely throttle IO on the 'import' process, and not
> get concerned until the job starts taking>  4 days. (Not to mention
> that this job could be optimized by only importing those entries that
> have changed, but we need to devote some time to developing and
> testing that part of the pipeline, and we aren't really interested in
> doing that).
> We then query the resulting RDB to generate the required data for
> PDBWiki, and that is automatically uploaded. Here again we only update
> those wiki pages that need to be changed.
> On 27 October 2010 23:44, J.W. Bizzaro<jeff at bioinformatics.org>  wrote:
>> We could probably do something like 3 x 2 TB SATA drives in a RAID5 config,
>> which would give us approx. 4 TB.
>> Also, what would be the bottleneck for such plan: the network, drive or CPU
>> speed?  Will you be transferring a lot of data, or will the DB (DBMS?) be
>> grinding through a lot of data?
>> Cheers,
>> Jeff
>> On 10/27/10 4:03 AM, Dan Bolser wrote:
>>> To be more specific, especially for database mirrors and the databases
>>> or datasets created directly from them, large cheap disk is fine,
>>> because there is no intellectual content that needs to be carefully
>>> backed up.
>>> Currently I'm looking at other options for hosting our copy of the PDB
>>> relational database. We will loose our current host around December,
>>> and we want to be ready to switch over at that time.
>>> Cheers,
>>> Dan.
>>> On 6 October 2010 21:26, Dan Bolser<dan.bolser at gmail.com>    wrote:
>>>> Sounds great!
>>>> On 6 October 2010 21:19, J.W. Bizzaro<jeff at bioinformatics.org>    wrote:
>>>>> Hi Guys,
>>>>> We're waiting on a payment from a sponsor, so the upgrades will probably
>>>>> be
>>>>> in the Oct.-Nov. timeframe.
>>>>> Also, since the per-GB storage charge for the hosted Web server (Dallas)
>>>>> is
>>>>> expensive, we'll likely set up a local server (Boston) with 2 TB and
>>>>> mirror
>>>>> some important bioinformatics DBs here.  And I propose that the PDB DB
>>>>> be
>>>>> located here.
>>>>> How does that sound?
>>>>> Cheers,
>>>>> Jeff
>>>>> On 10/6/2010 10:57 AM, Dan Bolser wrote:
>>>>>> Hi Jeff,
>>>>>> Did you upgrade the disk space on bifx.org yet? I seem to remember
>>>>>> that it was in the pipeline.
>>>>>> We'd like to mirror the PDB and host a PDB relational database
>>>>>> (probably much less than 200 Gb).
>>>>>> Cheers,
>>>>>> Dan.

J.W. Bizzaro
Bioinformatics Organization, Inc. (Bioinformatics.Org)
E-mail: jeff at bioinformatics.org
Phone:  +1 978 621 8258

More information about the Pdbwiki-devel mailing list