[Bioclusters] Forwarded from the blast-announce list

Joe Landman bioclusters@bioinformatics.org
26 Mar 2003 14:03:16 -0500


Folks:

  FASTA will continued to be offered, but it will be moved.  You will
likely have to adjust your download scripts.

Joe
-- 
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 612 4615

-----Forwarded Message-----

From: Scott McGinnis <mcginnis@ncbi.nlm.nih.gov>
To: blast-announce@ncbi.nlm.nih.gov
Subject: [blast-announce] [blast-announce #033] Relocation of BLAST database files on FTP server
Date: 26 Mar 2003 13:17:36 -0500

Moving of BLAST FASTA Database files.

Based upon input from the user community we will continue to offer
FASTA files. However, we will be reorganizing our FTP site in order to
allow easier access to the preformatted BLAST databases that users
of NCBI BLAST should be using.

For users of standalone BLAST the NCBI offers preformatted BLAST
databases already for downloading, so that there is no need to download
FASTA files (from ftp://ftp.ncbi.nih.gov/blast/db/) and run formatdb on
them. This offers several advantages to users who mostly need these
files to produce BLAST databases:

1.) no need to have disk space for both FASTA files and BLAST databases
at the same time.

2.) no need to use CPU cycles to uncompress the FASTA files and run
formatdb on them.

3.) the original FASTA file, individual sequences, or even parts of
individual sequences within the FASTA file can be recovered using the
utility fastacmd that is packaged with the NCBI BLAST executable
archives (see below for details).

4.) somewhat smaller bandwidth on the FTP downloads, allowing them to
take place faster.

5.) taxonomic and related source information (for individual entries in
the database) is implanted in the BLAST databases (this is not
available in the FASTA files).  Some of this information may be useful
for formatting, some can be recovered by fastacmd (see below).

As most users need only the BLAST databases they will be moved up one
level, from their current location of
ftp://ftp.ncbi.nih.gov/blast/db/FormattedDatabases/, and the FASTA
files will be moved down a level to
ftp://ftp.ncbi.nih.gov/blast/db/FASTA.  The new FASTA directory,
containing the files, will appear by March 31, 2003.  The FASTA files
will be removed from the "db" directory on April 8, 2003.  At that
point BLAST databases will start appearing the the "db" directory.

Note that the procedure at the NCBI is to produce the BLAST databases
directly from our relational databases, then produce the FASTA files
from the BLAST databases (using fastacmd).  This means that the FASTA
files could be available on our FTP site up to three hours after the
BLAST databases have appeared.


Notes on fastacmd:
------------------

1.) fastacmd can print a summary of database statistics:

ncbifastacmd -d nt -I

Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or
phase 0,1 or 2 HTGS sequences)
1,655,079 sequences; 7,754,000,938 total letters

File name:
/usr/ncbi/db/blast/nt
Date: Jan 14, 2003  2:55 AM
Version: 4
Longest sequence: 27,890,790 bp


2.) fastacmd can dump a FASTA file from a blast database using the -D
option:

ncbifastacmd -d nt -D nt.fsa

3.) fastacmd can dump out only part of a sequence (handy for very long
sequences):

ncbifastacmd -d nt -s 555 -L0,32
gi|555:1-32 B.taurus microsatellite DNA (624bp)
ACCTCCACTAGCTTTGTTTGTAGTGATGCTCT

4.) fastacmd can print taxonomic information for a given sequence if that
BLAST database came from ftp://ftp.ncbi.nih.gov/blast/db/FormattedDatabases/
(this information is not in a FASTA file so formatdb cannot add this).

ncbifastacmd -d nt -s 555 -T
NCBI sequence id: gi|555|emb|X65215.1|BTMISATN
NCBI taxonomy id: 9913
Common name: cow
Scientific name: Bos taurus

------------- End Forwarded Message -------------