[Pdbwiki-devel] Hosting the PDBWiki update pipeline

Dan Bolser dan.bolser at gmail.com
Thu Oct 28 01:51:23 EDT 2010


On 28 October 2010 01:10, J.W. Bizzaro <jeff at bioinformatics.org> wrote:
> So, let's see if I've got this right:
>
> There's an RDBMS running on a server somewhere, and you can't use it after
> sometime in December.

Yup, the RDB, the PDB mirror process and the importer process.


> This RDBMS requires a lot of time to initially import the PDB data.
>
> Once the RDB is set up, you only need to make small updates to it (it
> connects to a mirror?).

We need 24 hours once a week.


> OK, so let's say the new host for the RDBMS is on a separate server and IP
> address from the main Bioinformatics.Org webserver.  Where then would you
> want to run the PDBWiki website?   Does it matter?

No it doesn't matter. We update PDBWiki once a week, and that update
process needs access to the RDB and the webserver, but all three can
(in theory) run on different boxes in different continents).


> And, are you talking about something like MySQL for the RDBMS?

Yeah. There are a number of PDB to RDB converters, and they all mostly
work with a few different RDBMS. We picked MySQL.


> Cheers,
> Jeff
>
>
> On 10/27/10 6:34 PM, Dan Bolser wrote:
>>
>> No I don't think so, but Jose or Henning will know better.
>>
>> Currently we sync the PDB every week, but since only a tiny fraction
>> of the PDB changes in that time, the synchronization step takes about
>> 1 hour. Similarly, we only unzip those files that a) we will use and
>> b) that have changed.
>>
>> The 'big job' is running the 'importer', which loads the PDB files
>> into the relational database. This takes 24 hours, and reads all the
>> files in one division of the PDB (the PDB has about 5 different
>> divisions of roughly equal size).
>>
>> However, so long as that job takes<  1 week, we don't have any problem
>> i.e. we could severely throttle IO on the 'import' process, and not
>> get concerned until the job starts taking>  4 days. (Not to mention
>> that this job could be optimized by only importing those entries that
>> have changed, but we need to devote some time to developing and
>> testing that part of the pipeline, and we aren't really interested in
>> doing that).
>>
>> We then query the resulting RDB to generate the required data for
>> PDBWiki, and that is automatically uploaded. Here again we only update
>> those wiki pages that need to be changed.
>>
>>
>> On 27 October 2010 23:44, J.W. Bizzaro<jeff at bioinformatics.org>  wrote:
>>>
>>> We could probably do something like 3 x 2 TB SATA drives in a RAID5
>>> config,
>>> which would give us approx. 4 TB.
>>>
>>> Also, what would be the bottleneck for such plan: the network, drive or
>>> CPU
>>> speed?  Will you be transferring a lot of data, or will the DB (DBMS?) be
>>> grinding through a lot of data?
>>>
>>> Cheers,
>>> Jeff
>>>
>>> On 10/27/10 4:03 AM, Dan Bolser wrote:
>>>>
>>>> To be more specific, especially for database mirrors and the databases
>>>> or datasets created directly from them, large cheap disk is fine,
>>>> because there is no intellectual content that needs to be carefully
>>>> backed up.
>>>>
>>>> Currently I'm looking at other options for hosting our copy of the PDB
>>>> relational database. We will loose our current host around December,
>>>> and we want to be ready to switch over at that time.
>>>>
>>>> Cheers,
>>>> Dan.
>>>>
>>>> On 6 October 2010 21:26, Dan Bolser<dan.bolser at gmail.com>    wrote:
>>>>>
>>>>> Sounds great!
>>>>>
>>>>>
>>>>>
>>>>> On 6 October 2010 21:19, J.W. Bizzaro<jeff at bioinformatics.org>
>>>>>  wrote:
>>>>>>
>>>>>> Hi Guys,
>>>>>>
>>>>>> We're waiting on a payment from a sponsor, so the upgrades will
>>>>>> probably
>>>>>> be
>>>>>> in the Oct.-Nov. timeframe.
>>>>>>
>>>>>> Also, since the per-GB storage charge for the hosted Web server
>>>>>> (Dallas)
>>>>>> is
>>>>>> expensive, we'll likely set up a local server (Boston) with 2 TB and
>>>>>> mirror
>>>>>> some important bioinformatics DBs here.  And I propose that the PDB DB
>>>>>> be
>>>>>> located here.
>>>>>>
>>>>>> How does that sound?
>>>>>>
>>>>>> Cheers,
>>>>>> Jeff
>>>>>>
>>>>>>
>>>>>> On 10/6/2010 10:57 AM, Dan Bolser wrote:
>>>>>>>
>>>>>>> Hi Jeff,
>>>>>>>
>>>>>>> Did you upgrade the disk space on bifx.org yet? I seem to remember
>>>>>>> that it was in the pipeline.
>>>>>>>
>>>>>>> We'd like to mirror the PDB and host a PDB relational database
>>>>>>> (probably much less than 200 Gb).
>>>>>>>
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Dan.
>
> --
> J.W. Bizzaro
> Bioinformatics Organization, Inc. (Bioinformatics.Org)
> E-mail: jeff at bioinformatics.org
> Phone:  +1 978 621 8258
> --
>



More information about the Pdbwiki-devel mailing list