[Bioclusters] Local copy of NCBI

andy law (RI) bioclusters@bioinformatics.org
Tue, 23 Sep 2003 10:18:36 +0100


Downloading sequence files from remote sites is something that we have just been working on at Roslin, where bandwidth into the site is often restrictive. A couple of things that we found would make management of the whole process much easier is md5 digest signatures and (subject to careful naming conventions) diff files.

Given a set of 30 enormous FASTA files, it is far more efficient for us to pull 30 md5 signatures, work out which files have changed, look for a diff file or files that have date-stamps in their name that post-date our last download and just pull them across rather than have to download the whole set.

I recognise that this is more than just mirroring what the primary sites often provide but it certainly could be useful.

Later,

Andy
--
Dr. Andy Law
--------------------
Head of Bioinformatics - Roslin Institute

Unfortunately, legal niceties require me to add the following to this message...

The information contained in this e-mail (including any attachments) is confidential and is intended for the use of the addressee only.   The opinions expressed within this e-mail (including any attachments) are the opinions of the sender and do not necessarily constitute those of Roslin Institute (Edinburgh) ("the Institute") unless specifically stated by a sender who is duly authorised to do so on behalf of the Institute.


> -----Original Message-----
> From: Tim Harsch [mailto:harsch1@llnl.gov]
> Sent: 23 September 2003 00:56
> To: bioclusters@bioinformatics.org
> Subject: Re: [Bioclusters] Local copy of NCBI
> 
> 
> I also wanted to cast my vote inline with Joe's.  I'd like to 
> figure out the
> best, or even a very reasonable, algorithm for doing the 
> downloads/formats
> etc.  I you're site giving the Rsync method is most welcomed. 
>  But, I may
> need those FASTA files for updates and would have to have 
> them in order to
> develop a method that would use your site.
> 
> Thanks much for providing the mirror with or without the FASTA files!!
> 
> ----- Original Message ----- 
> From: "Joe Landman" <landman@scalableinformatics.com>
> To: "biocluster" <bioclusters@bioinformatics.org>
> Sent: Sunday, September 21, 2003 10:49 PM
> Subject: Re: [Bioclusters] Local copy of NCBI
> 
> 
> > Hi Josh:
> >
> >   Last I checked, you had only the binary databases up 
> there.  As a fair
> > number of users need to segment the databases for 
> performance and other
> > reasons, it might help to have the FASTA formatted files there as
> > well.   It would save processing time (no additional steps).
> >
> > Joe
> >
> > On Mon, 2003-09-22 at 13:39, Josh Goodman wrote:
> > > In addition to the NCBI server you may want to take a look at our
> > > database mirroring service at http://www.bio-mirror.net.  
> We offer most
> of
> > > the NCBI dbs and other important dbs with mirrors all 
> over the world.
> > > Most servers support ftp and http but the USA server also 
> mirrors data
> via
> > > rsync.  If you don't see a database that you think we 
> should have let us
> > > know and we will try to get it up there.
> > >
> > > Josh Goodman
> > > Indiana University
> > >
> > >
> > >
> > >
> > > ------------------------------------
> > > Subject: Re: [Bioclusters] Local copy of NCBI
> > > From: Nox <pheusion@snet.net>
> > > To: bioclusters@bioinformatics.org
> > > Cc: "Tang, Kevin" <kht7@cdc.gov>
> > > Date: Thu, 18 Sep 2003 13:16:38 -0400
> > > Reply-To: bioclusters@bioinformatics.org
> > >
> > > We are using in-house perl scripts, in crontab, that
> > > uses wget to pull updates from the DB.
> > > Perl is great for parsing, so thats what my developers are using.
> > >
> > > Unfortunatly I cant copy the script in here,
> > > but I can tell you it relies on the wget heavily,
> > > and  perl provides the transition  to populate our DB
> > >
> > > Hope that helps
> > >
> > > Nox
> > > GenMicro Systems
> > >
> > > On Thu, 2003-09-18 at 09:46, Osborne, John wrote:
> > > > Hi everyone,
> > > > What are people out there doing to get a local copy of NCBI's
> databases?  I
> > > > mean RefSeq, dbSNP, taxonomy, etc...  We've been 
> updating our copy
> ad-hoc by
> > > > ftp, are most people just putting this into a cron job?
> > > >
> > > > I've heard that the NCBI tookkit offers something like 
> this (to get
> daily
> > > > updates via web services or something) but I don't know 
> where to look.
> > > > getseq looks suspicious but I need to configure it 
> using entrez2,
> which
> > > > needs X Windows, which needs vibrant, which means RH dependency
> hell...  Is
> > > > there a simple commandline way to get get a seequence 
> from NCBI and
> keep a
> > > > local copy of NCBI?
> >
> > -- 
> > Joseph Landman, Ph.D
> > Scalable Informatics LLC,
> > email: landman@scalableinformatics.com
> > web  : http://scalableinformatics.com
> > phone: +1 734 612 4615
> >
> > _______________________________________________
> > Bioclusters maillist  -  Bioclusters@bioinformatics.org
> > https://bioinformatics.org/mailman/listinfo/bioclusters
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>