[Bioclusters] NCBI updates and how you do them

jason.calvert at novartis.com jason.calvert at novartis.com
Sat Apr 30 10:23:08 EDT 2005

We have many databases to distribute so I have written a few scripts to 
benchmark different distribution methods, udpcast seems to work best for 
us currently.  I have a script that checks the size of each file in the 
database directory on each node, and them makes a list of files that need 
to be udpcast-ed from the master copy on the distribution node and to 
which nodes to cast the file to.  It then starts up a listener on the 
appropriate nodes for each file and sends it out.  Since UDP cast is slow 
for a smaller number of clients, it also checks to see how many clients 
need the file, and if it is smaller than your set break point, it uses NFS 
(could be a command line switch to rsync) to distribute the file at the 
same time as the udpcast is going on.  There is also a setpoint for the 
filesize to decide whether to use udpcast or NFS.

We use filesize as an indicator as it takes 4 hours just to do the 
checksum on all our files each night, and this time will be growing with 
our databases.  This could be easily made a command line switch to choose 
what method to use.   udpcast has different data checking in it's protocol 
to cover for UDP.

I have also written a script that uses a treed rsync to distribute the 
data, but rsync was using way too much overhead with the size of our 
databases, and these will be growing.

I was planning on updating the script to do checksums weekly, but I found 
a problem with our kernel I had to solve first.  I will be starting to 
develop the scripts again this coming week. 

Is anybody interested in such a project?

Well I am out of wind,


Jan van Haarst <jvhaarst at gmail.com>
Sent by: 
bioclusters-bounces+jason.calvert=pharma.novartis.com at bioinformatics.org
04/30/2005 03:55 AM
Please respond to jan; Please respond to "Clustering,  compute farming & 
distributed computing in life science informatics"

        To:     jeremy at biochem.uthscsa.edu, "Clustering,  compute farming & distributed 
computing in life science informatics" <bioclusters at bioinformatics.org>
        cc:     (bcc: Jason Calvert/PH/Novartis)
        Subject:        Re: [Bioclusters] NCBI updates and how you do them

On our cluster we use UDPcast ( http://udpcast.linux.lu/ ) to push the data to the nodes, and rsync afterwards to double check the 
The way I understood it, rsync and the (non FASTA) blast databases don't 
work well together, you end up sending the complete database through 
rsync, which isn't the best solution if you want to push data to a lot of 
nodes at the same time. 
But maybe that isn't the case anymore, what do you see when you update the 
database through rsync ?
UDPcast works by broadcasting the data to the nodes, on which listeners 
pick up the data. 
There are other ways to distribute data form one to many, but UDPcast 
works fine for us.
Kind regards,
2005/4/26, Jeremy Mann jeremy at biochem.uthscsa.edu: 

Is rsync the way to push to all nodes? If not, what other alternatives 

Jeremy Mann 
jeremy at biochem.uthscsa.edu_______________________________________________
Bioclusters maillist  -  Bioclusters at bioinformatics.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://bioinformatics.org/pipermail/bioclusters/attachments/20050430/6b6bd890/attachment.htm

More information about the Bioclusters mailing list