[Pdbwiki-devel] Hosting the PDBWiki update pipeline

Dan Bolser dan.bolser at gmail.com
Wed Oct 27 18:34:26 EDT 2010


No I don't think so, but Jose or Henning will know better.

Currently we sync the PDB every week, but since only a tiny fraction
of the PDB changes in that time, the synchronization step takes about
1 hour. Similarly, we only unzip those files that a) we will use and
b) that have changed.

The 'big job' is running the 'importer', which loads the PDB files
into the relational database. This takes 24 hours, and reads all the
files in one division of the PDB (the PDB has about 5 different
divisions of roughly equal size).

However, so long as that job takes < 1 week, we don't have any problem
i.e. we could severely throttle IO on the 'import' process, and not
get concerned until the job starts taking > 4 days. (Not to mention
that this job could be optimized by only importing those entries that
have changed, but we need to devote some time to developing and
testing that part of the pipeline, and we aren't really interested in
doing that).

We then query the resulting RDB to generate the required data for
PDBWiki, and that is automatically uploaded. Here again we only update
those wiki pages that need to be changed.


On 27 October 2010 23:44, J.W. Bizzaro <jeff at bioinformatics.org> wrote:
> We could probably do something like 3 x 2 TB SATA drives in a RAID5 config,
> which would give us approx. 4 TB.
>
> Also, what would be the bottleneck for such plan: the network, drive or CPU
> speed?  Will you be transferring a lot of data, or will the DB (DBMS?) be
> grinding through a lot of data?
>
> Cheers,
> Jeff
>
> On 10/27/10 4:03 AM, Dan Bolser wrote:
>>
>> To be more specific, especially for database mirrors and the databases
>> or datasets created directly from them, large cheap disk is fine,
>> because there is no intellectual content that needs to be carefully
>> backed up.
>>
>> Currently I'm looking at other options for hosting our copy of the PDB
>> relational database. We will loose our current host around December,
>> and we want to be ready to switch over at that time.
>>
>> Cheers,
>> Dan.
>>
>> On 6 October 2010 21:26, Dan Bolser<dan.bolser at gmail.com>  wrote:
>>>
>>> Sounds great!
>>>
>>>
>>>
>>> On 6 October 2010 21:19, J.W. Bizzaro<jeff at bioinformatics.org>  wrote:
>>>>
>>>> Hi Guys,
>>>>
>>>> We're waiting on a payment from a sponsor, so the upgrades will probably
>>>> be
>>>> in the Oct.-Nov. timeframe.
>>>>
>>>> Also, since the per-GB storage charge for the hosted Web server (Dallas)
>>>> is
>>>> expensive, we'll likely set up a local server (Boston) with 2 TB and
>>>> mirror
>>>> some important bioinformatics DBs here.  And I propose that the PDB DB
>>>> be
>>>> located here.
>>>>
>>>> How does that sound?
>>>>
>>>> Cheers,
>>>> Jeff
>>>>
>>>>
>>>> On 10/6/2010 10:57 AM, Dan Bolser wrote:
>>>>>
>>>>> Hi Jeff,
>>>>>
>>>>> Did you upgrade the disk space on bifx.org yet? I seem to remember
>>>>> that it was in the pipeline.
>>>>>
>>>>> We'd like to mirror the PDB and host a PDB relational database
>>>>> (probably much less than 200 Gb).
>>>>>
>>>>>
>>>>> Cheers,
>>>>> Dan.



More information about the Pdbwiki-devel mailing list