[Bioclusters] Local copy of NCBI

Tim Harsch bioclusters@bioinformatics.org
Thu, 25 Sep 2003 11:48:28 -0700


It seems the real problem here is that NCBI has not provided an adequate
system for downloading data.  Not to slam them, because they have been doing
a decent job for such a difficult task as being the public repository for
sequences.  Perhaps the real win for us would be to get someone from NCBI in
on a thread for an enhanced design of the process.  Whereby, we could all
suggest and agree on a process that would better suit the community.  I
don't know if NCBI would be agreeable, but even if they don't have the
manpower at this very moment to implement the proposed implementation they
would, hopefully by the time we are through have a goal in mind that they
could work towards, perhaps in increments, over the course of time.

Does anyone know who, as an NCBI contact, might be willing to get in on this
dicsussion?  Perhaps we could entice them, and maybe members of their
development team, into joining the list (if they are not already members)...

----- Original Message ----- 
From: "andy law (RI)" <andy.law@bbsrc.ac.uk>
To: <bioclusters@bioinformatics.org>
Sent: Tuesday, September 23, 2003 2:18 AM
Subject: RE: [Bioclusters] Local copy of NCBI


> Downloading sequence files from remote sites is something that we have
just been working on at Roslin, where bandwidth into the site is often
restrictive. A couple of things that we found would make management of the
whole process much easier is md5 digest signatures and (subject to careful
naming conventions) diff files.
>
> Given a set of 30 enormous FASTA files, it is far more efficient for us to
pull 30 md5 signatures, work out which files have changed, look for a diff
file or files that have date-stamps in their name that post-date our last
download and just pull them across rather than have to download the whole
set.
>
> I recognise that this is more than just mirroring what the primary sites
often provide but it certainly could be useful.
>
> Later,
>
> Andy
> --
> Dr. Andy Law
> --------------------
> Head of Bioinformatics - Roslin Institute
>
> Unfortunately, legal niceties require me to add the following to this
message...
>
> The information contained in this e-mail (including any attachments) is
confidential and is intended for the use of the addressee only.   The
opinions expressed within this e-mail (including any attachments) are the
opinions of the sender and do not necessarily constitute those of Roslin
Institute (Edinburgh) ("the Institute") unless specifically stated by a
sender who is duly authorised to do so on behalf of the Institute.
>
>
> > -----Original Message-----
> > From: Tim Harsch [mailto:harsch1@llnl.gov]
> > Sent: 23 September 2003 00:56
> > To: bioclusters@bioinformatics.org
> > Subject: Re: [Bioclusters] Local copy of NCBI
> >
> >
> > I also wanted to cast my vote inline with Joe's.  I'd like to
> > figure out the
> > best, or even a very reasonable, algorithm for doing the
> > downloads/formats
> > etc.  I you're site giving the Rsync method is most welcomed.
> >  But, I may
> > need those FASTA files for updates and would have to have
> > them in order to
> > develop a method that would use your site.
> >
> > Thanks much for providing the mirror with or without the FASTA files!!
> >
> > ----- Original Message ----- 
> > From: "Joe Landman" <landman@scalableinformatics.com>
> > To: "biocluster" <bioclusters@bioinformatics.org>
> > Sent: Sunday, September 21, 2003 10:49 PM
> > Subject: Re: [Bioclusters] Local copy of NCBI
> >
> >
> > > Hi Josh:
> > >
> > >   Last I checked, you had only the binary databases up
> > there.  As a fair
> > > number of users need to segment the databases for
> > performance and other
> > > reasons, it might help to have the FASTA formatted files there as
> > > well.   It would save processing time (no additional steps).
> > >
> > > Joe
> > >
> > > On Mon, 2003-09-22 at 13:39, Josh Goodman wrote:
> > > > In addition to the NCBI server you may want to take a look at our
> > > > database mirroring service at http://www.bio-mirror.net.
> > We offer most
> > of
> > > > the NCBI dbs and other important dbs with mirrors all
> > over the world.
> > > > Most servers support ftp and http but the USA server also
> > mirrors data
> > via
> > > > rsync.  If you don't see a database that you think we
> > should have let us
> > > > know and we will try to get it up there.
> > > >
> > > > Josh Goodman
> > > > Indiana University
> > > >
> > > >
> > > >
> > > >
> > > > ------------------------------------
> > > > Subject: Re: [Bioclusters] Local copy of NCBI
> > > > From: Nox <pheusion@snet.net>
> > > > To: bioclusters@bioinformatics.org
> > > > Cc: "Tang, Kevin" <kht7@cdc.gov>
> > > > Date: Thu, 18 Sep 2003 13:16:38 -0400
> > > > Reply-To: bioclusters@bioinformatics.org
> > > >
> > > > We are using in-house perl scripts, in crontab, that
> > > > uses wget to pull updates from the DB.
> > > > Perl is great for parsing, so thats what my developers are using.
> > > >
> > > > Unfortunatly I cant copy the script in here,
> > > > but I can tell you it relies on the wget heavily,
> > > > and  perl provides the transition  to populate our DB
> > > >
> > > > Hope that helps
> > > >
> > > > Nox
> > > > GenMicro Systems
> > > >
> > > > On Thu, 2003-09-18 at 09:46, Osborne, John wrote:
> > > > > Hi everyone,
> > > > > What are people out there doing to get a local copy of NCBI's
> > databases?  I
> > > > > mean RefSeq, dbSNP, taxonomy, etc...  We've been
> > updating our copy
> > ad-hoc by
> > > > > ftp, are most people just putting this into a cron job?
> > > > >
> > > > > I've heard that the NCBI tookkit offers something like
> > this (to get
> > daily
> > > > > updates via web services or something) but I don't know
> > where to look.
> > > > > getseq looks suspicious but I need to configure it
> > using entrez2,
> > which
> > > > > needs X Windows, which needs vibrant, which means RH dependency
> > hell...  Is
> > > > > there a simple commandline way to get get a seequence
> > from NCBI and
> > keep a
> > > > > local copy of NCBI?
> > >
> > > -- 
> > > Joseph Landman, Ph.D
> > > Scalable Informatics LLC,
> > > email: landman@scalableinformatics.com
> > > web  : http://scalableinformatics.com
> > > phone: +1 734 612 4615
> > >
> > > _______________________________________________
> > > Bioclusters maillist  -  Bioclusters@bioinformatics.org
> > > https://bioinformatics.org/mailman/listinfo/bioclusters
> >
> > _______________________________________________
> > Bioclusters maillist  -  Bioclusters@bioinformatics.org
> > https://bioinformatics.org/mailman/listinfo/bioclusters
> >
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters