Sorry for the Opteron spam, but I hope this will help folks doing this in the future ;) We now believe that the abberant behavior in NCBI blast in some configurations can be completely traced to a single character change in the source code... In recent releases of the ncbi toolkit, the formatdb options to create ASN.1 structured deflines (-A) has been turned on by default, a divergence from previous behavior. Unpredictable (and wrong!) things happen when sequences are input to formatdb that do not follow the arcane NCBI fasta naming terminology (foo|bar|etc|blah) when this option is selected. In our case, we were using very simple naming conventions: >name1 >name2 >name3 (ncbi would have demanded something like >lcl|name1 ) etc. This is not compatible with the new default behavior of formatdb. Solution: if you do not follow the NCBI fasta naming structure exactly, use the -A F option of formatdb and/or change the default in formatdb.c. NCBI toolkit versions somewhere after 2.2.1 have this problem. Classic NCBI. Nathan Nathan O. Siemers wrote: > All: > > Joe Landman from Scalable Informatics, Lawrence Hannon from IBM, and > I have been working on issues running blast on the AMD opteron platform. > I've summarized my results (with much help from Joe and Lawrence) in > validating the blastall and formatdb code. There are quirks with the > latest versions of the NCBI toolkit, producing corrupt blast results in > some situations. They only appear with some (large) databases but we > are not sure what exactly causes this behavior at the present time. We > have tentative workarounds, listed below. > > > Thanks to everyone who has helped me over the past few weeks - the > bottom line is that *none* of the problems I have seen over the past > weeks could actually be traced to problems with Opteron hardware (other > than a RAM chip) or Linux OS. This is great news for Opteron. > > > > SUMMARY > > Builds of formatdb and blastall from the NCBI Toolkit version 2.2.6 > can produce corrupted output when used with some formatdb parameters > in all builds so far tested on the AMD Opteron 64 bit platform. > Symptoms include failure to produce a correctly named .nal or .pal > file when databases are split up into volumes. Pointer errors produce > incorrect results and alignments with some large databases. NCBI > Toolkit 2.2.1 does not show this behavior. Some of these errors have > been reproduced by us on SGI MIPS IRIX platforms with SGI compilers, > suggesting that the errors are neither Opteron nor compiler specific. > > > > > > Current workarounds are to: > > 1. explicitly name the formatdb output database with the -n option > > 2. use the '-o T' option in formatdb to alter the way blast indices > are created. > > Alternatively: > > 3. Use the 2.2.1 version of the blastall tools. > > > > > > _______________________________________ > > TESTS > > Machine, OS, libs: > > 2 CPU AMD Opteron (Penguin), 6G RAM, SUSE Linux 8, 2.4.19 SMP Linux > Kernel. > > Current configuration: > > opt:/gcgblast # gcc -v > Reading specs from /usr/lib64/gcc-lib/x86_64-suse-linux/3.2.2/specs > Configured with: ../configure --enable-threads=posix --prefix=/usr > --with-local-prefix=/usr/local --infodir=/usr/share/info > --mandir=/usr/share/man --libdir=/usr/lib64 > --enable-languages=c,c++,f77,objc,java,ada --enable-libgcj > --with-gxx-include-dir=/usr/include/g++ --with-slibdir=/lib > --with-system-zlib --enable-shared --enable-__cxa_atexit x86_64-suse-linux > Thread model: posix > gcc version 3.2.2 (SuSE Linux) > > (gcc-3.2.2-26.x86_64.rpm) > (glibc-2.2.5-184.x86_64.rpm) > > ldd /usr/local/bin/blastall: > > libm.so.6 => /lib64/libm.so.6 (0x0000002a9566d000) > libpthread.so.0 => /lib64/libpthread.so.0 (0x0000002a957c6000) > libc.so.6 => /lib64/libc.so.6 (0x0000002a958e2000) > /lib64/ld-linux-x86-64.so.2 => /lib64/ld-linux-x86-64.so.2 > (0x0000002a95556000) > > > _______________________________________ > > > Databases: > > ncbi: Human genome scaffold broken into 100KB pieces, 50KB overlap ( > 5.9G ) > > sncbi: same as above but long sequence names converted to shorter form > (some names were very long and I wanted to make sure this was not an > name indexing problem) > > htg: 20 August download of NCBI htg sequence file (11G uncompressed) > > _______________________________________ > > Formatdb options: > > o: using '-o T' option for indexing > > no_o: no -o option > > Other formatdb options used: '-p F -n <name> -i <fasta_file>' > > _______________________________________ > > blastall options: '-p tblastn -v 3 -b 3 -a 2 -d <db> -i <input_file>' > > _______________________________________ > > Input file: 12 protein sequences from fly refseq: > >BMSPROT:NP_478140 > >BMSPROT:NP_523807 > >BMSPROT:NP_609725 > >BMSPROT:NP_524716 > >BMSPROT:NP_524665 > >BMSPROT:NP_524468 > >BMSPROT:NP_523392 > >BMSPROT:NP_572997 > >BMSPROT:NP_524671 > >BMSPROT:NP_608480 > >BMSPROT:NP_524763 > >BMSPROT:NP_524817 > > (I've checked, the 'BMSPROT:' prefix doesn't seem to affect the analysis). > _______________________________________ > > R E S U L T S > ____________________________________________________________________ > > NCBI Toolkit ncbi-o ncbi-no_o sncbi_o sncbi-no_o htg-o htg-no_o > > 2.2.1 pass pass pass pass pass pass > > 2.2.6 pass FAIL* pass FAIL* pass pass > > ____________________________________________________________________ > > > * - FAIL symptoms include error messages: '[blastall] ERROR: ncbiapi > [000.000] > BMSPROT:NP_478140: ObjMgrChoice: pointer [0] type [1] not found', > missing names for > sequence names of db hits in BLAST summary and sporadic nonsense > alignments. > > CONFIGURATION > > IBM,Siemers Opteron linux.ncbi.mk directives for 2.2.6 (April 2003), > SUSE 8.1 opteron > Linux > > NCBI_DEFAULT_LCL = lnx > NCBI_MAKE_SHELL = /bin/sh > NCBI_CC = gcc -pipe -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -O3 > -DOS_UNIX_PPCLINUX -I../include -I/usr/X11R6/include -L/usr/X11R6/lib64 > -DWIN_MOTIF > # should probably be /usr/X11R6/lib64 above on SUSE 8.1 > NCBI_CFLAGS1 = -c > NCBI_LDFLAGS1 = > NCBI_OPTFLAG = > > Opteron linux.ncbi.mk directives for 2.2.1 NCBI Toolkit: > > > NCBI_DEFAULT_LCL = lnx > NCBI_MAKE_SHELL = /bin/sh > NCBI_CC = gcc -pipe -D__USE_FILE_OFFSET64 -D__USE_LARGEFILE64 > NCBI_CFLAGS1 = -c -DOS_UNIX_PPCLINUX > NCBI_LDFLAGS1 = -O2 > NCBI_OPTFLAG = -O2 > > _______________________________________________ > Bioclusters maillist - Bioclusters@bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters -- Nathan Siemers|Associate Director|Applied Genomics|Bristol-Myers Squibb Pharmaceutical Research Institute|HW3-0.07|P.O. Box 5400|Princeton, NJ 08543-5400|(609)818-6568|nathan.siemers@bms.com