[Bioclusters] Local blast server, beowulf vs mosix

Eric Engelhard bioclusters@bioinformatics.org
Thu, 07 Mar 2002 23:02:26 -0800


"Chris Dwan (CCGB)" wrote:
> 
> The December 2001 version of "formatdb" will split up your targets
> into chunks of arbitrary size for you, via the "-v
> <max_size_of_a_chunk>" flag.  I think that it was intended to get
> around file size limitations on some larger datasets / older OS's, but
> it also works nicely for my group to keep things under the RAM / CPU
> performace transition point.

Thanks Chris, you made my day!

I had read the release notes and had interpreted it to mean that the -v
flag created a _fixed_ max of 2 billion letters for really large custom
databases. Here is the pertinent release section:

3.) A volume option ('-v') has been added to formatdb.  This option
breaks up large
FASTA files into 'volumes' (each with a maximum size of 2 billion
letters).
As part of the creation of a volume formatdb writes a new type of BLAST
database file,
called an alias file, with the extension 'nal' or 'pal', is written. 
This option
should be used if one wishes to formatdb large databases (e.g., over 2
billion 
base pairs).

The README.formatdb Section C is much clearer:

One may also specify a smaller size for the volume databases by using
the -v option:
 
formatdb -i hugefasta -p F -v 2000000000
 
This command line will format the "hugefasta" FASTA file as a number of
database "volumes," each containing a maximum of two billion base
pairs, as specified by the "-v" option. Two billion is the current
limitation on the NCBI toolkit command-line parser. The volumes will
have names consisting of the root database name, "hugefasta" followed
by a two-digit volume extension, followed by the usual BLAST database
extensions. These smaller databases can be searched as if they were a
single entity using:
 
blastall -i infile -d hugefasta -p blastn -o out