Downloading sequence files from remote sites is something that we have just been working on at Roslin, where bandwidth into the site is often restrictive. A couple of things that we found would make management of the whole process much easier is md5 digest signatures and (subject to careful naming conventions) diff files. Given a set of 30 enormous FASTA files, it is far more efficient for us to pull 30 md5 signatures, work out which files have changed, look for a diff file or files that have date-stamps in their name that post-date our last download and just pull them across rather than have to download the whole set. I recognise that this is more than just mirroring what the primary sites often provide but it certainly could be useful. Later, Andy -- Dr. Andy Law -------------------- Head of Bioinformatics - Roslin Institute Unfortunately, legal niceties require me to add the following to this message... The information contained in this e-mail (including any attachments) is confidential and is intended for the use of the addressee only. The opinions expressed within this e-mail (including any attachments) are the opinions of the sender and do not necessarily constitute those of Roslin Institute (Edinburgh) ("the Institute") unless specifically stated by a sender who is duly authorised to do so on behalf of the Institute. > -----Original Message----- > From: Tim Harsch [mailto:harsch1@llnl.gov] > Sent: 23 September 2003 00:56 > To: bioclusters@bioinformatics.org > Subject: Re: [Bioclusters] Local copy of NCBI > > > I also wanted to cast my vote inline with Joe's. I'd like to > figure out the > best, or even a very reasonable, algorithm for doing the > downloads/formats > etc. I you're site giving the Rsync method is most welcomed. > But, I may > need those FASTA files for updates and would have to have > them in order to > develop a method that would use your site. > > Thanks much for providing the mirror with or without the FASTA files!! > > ----- Original Message ----- > From: "Joe Landman" <landman@scalableinformatics.com> > To: "biocluster" <bioclusters@bioinformatics.org> > Sent: Sunday, September 21, 2003 10:49 PM > Subject: Re: [Bioclusters] Local copy of NCBI > > > > Hi Josh: > > > > Last I checked, you had only the binary databases up > there. As a fair > > number of users need to segment the databases for > performance and other > > reasons, it might help to have the FASTA formatted files there as > > well. It would save processing time (no additional steps). > > > > Joe > > > > On Mon, 2003-09-22 at 13:39, Josh Goodman wrote: > > > In addition to the NCBI server you may want to take a look at our > > > database mirroring service at http://www.bio-mirror.net. > We offer most > of > > > the NCBI dbs and other important dbs with mirrors all > over the world. > > > Most servers support ftp and http but the USA server also > mirrors data > via > > > rsync. If you don't see a database that you think we > should have let us > > > know and we will try to get it up there. > > > > > > Josh Goodman > > > Indiana University > > > > > > > > > > > > > > > ------------------------------------ > > > Subject: Re: [Bioclusters] Local copy of NCBI > > > From: Nox <pheusion@snet.net> > > > To: bioclusters@bioinformatics.org > > > Cc: "Tang, Kevin" <kht7@cdc.gov> > > > Date: Thu, 18 Sep 2003 13:16:38 -0400 > > > Reply-To: bioclusters@bioinformatics.org > > > > > > We are using in-house perl scripts, in crontab, that > > > uses wget to pull updates from the DB. > > > Perl is great for parsing, so thats what my developers are using. > > > > > > Unfortunatly I cant copy the script in here, > > > but I can tell you it relies on the wget heavily, > > > and perl provides the transition to populate our DB > > > > > > Hope that helps > > > > > > Nox > > > GenMicro Systems > > > > > > On Thu, 2003-09-18 at 09:46, Osborne, John wrote: > > > > Hi everyone, > > > > What are people out there doing to get a local copy of NCBI's > databases? I > > > > mean RefSeq, dbSNP, taxonomy, etc... We've been > updating our copy > ad-hoc by > > > > ftp, are most people just putting this into a cron job? > > > > > > > > I've heard that the NCBI tookkit offers something like > this (to get > daily > > > > updates via web services or something) but I don't know > where to look. > > > > getseq looks suspicious but I need to configure it > using entrez2, > which > > > > needs X Windows, which needs vibrant, which means RH dependency > hell... Is > > > > there a simple commandline way to get get a seequence > from NCBI and > keep a > > > > local copy of NCBI? > > > > -- > > Joseph Landman, Ph.D > > Scalable Informatics LLC, > > email: landman@scalableinformatics.com > > web : http://scalableinformatics.com > > phone: +1 734 612 4615 > > > > _______________________________________________ > > Bioclusters maillist - Bioclusters@bioinformatics.org > > https://bioinformatics.org/mailman/listinfo/bioclusters > > _______________________________________________ > Bioclusters maillist - Bioclusters@bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters >