SOLVED Re: [Bioclusters] Opteron Perl64 segfault issues

Nathan O. Siemers bioclusters@bioinformatics.org
Wed, 27 Aug 2003 08:48:54 -0400


	Sorry for the Opteron spam, but I hope this will help folks doing this 
in the future ;)

	We now believe that the abberant behavior in NCBI blast in some 
configurations can be completely traced to a single character change in 
the source code...

	In recent releases of the ncbi toolkit, the formatdb options to create 
ASN.1 structured deflines (-A) has been turned on by default, a 
divergence from previous behavior.  Unpredictable (and wrong!) things 
happen when sequences are input to formatdb that do not follow the 
arcane NCBI fasta naming terminology (foo|bar|etc|blah) when this option 
is selected.  In our case, we were using very simple naming conventions:

 >name1
 >name2
 >name3

(ncbi would have demanded something like >lcl|name1  )


etc.  This is not compatible with the new default behavior of formatdb.

Solution:  if you do not follow the NCBI fasta naming structure exactly, 
use the -A F option of formatdb and/or change the default in formatdb.c.

NCBI toolkit versions somewhere after 2.2.1 have this problem.
	
	Classic NCBI.

	Nathan



	


		

Nathan O. Siemers wrote:
> All:
> 
>     Joe Landman from Scalable Informatics, Lawrence Hannon from IBM, and 
> I have been working on issues running blast on the AMD opteron platform. 
> I've summarized my results (with much help from Joe and Lawrence) in 
> validating the blastall and formatdb code.  There are quirks with the 
> latest versions of the NCBI toolkit, producing corrupt blast results in 
> some situations.  They only appear with some (large) databases but we 
> are not sure what exactly causes this behavior at the present time.  We 
> have tentative workarounds, listed below.
> 
> 
> Thanks to everyone who has helped me over the past few weeks - the 
> bottom line is that *none* of the problems I have seen over the past 
> weeks could actually be traced to problems with Opteron hardware (other 
> than a RAM chip) or Linux OS.  This is great news for Opteron.
> 
> 
> 
> SUMMARY
> 
> Builds of formatdb and blastall from the NCBI Toolkit version 2.2.6
> can produce corrupted output when used with some formatdb parameters
> in all builds so far tested on the AMD Opteron 64 bit platform.
> Symptoms include failure to produce a correctly named .nal or .pal
> file when databases are split up into volumes.  Pointer errors produce
> incorrect results and alignments with some large databases.  NCBI
> Toolkit 2.2.1 does not show this behavior.  Some of these errors have
> been reproduced by us on SGI MIPS IRIX platforms with SGI compilers,
> suggesting that the errors are neither Opteron nor compiler specific.
> 
> 
> 
> 
> 
> Current workarounds are to:
> 
>     1.  explicitly name the formatdb output database with the -n option
> 
>     2.  use the '-o T' option in formatdb to alter the way blast indices
>         are created.
> 
>     Alternatively:
>     
>     3.  Use the 2.2.1 version of the blastall tools.
> 
> 
> 
> 
> 
> _______________________________________
> 
> TESTS
> 
> Machine, OS, libs:
> 
> 2 CPU AMD Opteron (Penguin), 6G RAM, SUSE Linux 8, 2.4.19 SMP Linux
> Kernel.
> 
> Current configuration:
> 
> opt:/gcgblast # gcc -v
> Reading specs from /usr/lib64/gcc-lib/x86_64-suse-linux/3.2.2/specs
> Configured with: ../configure --enable-threads=posix --prefix=/usr 
> --with-local-prefix=/usr/local --infodir=/usr/share/info 
> --mandir=/usr/share/man --libdir=/usr/lib64 
> --enable-languages=c,c++,f77,objc,java,ada --enable-libgcj 
> --with-gxx-include-dir=/usr/include/g++ --with-slibdir=/lib 
> --with-system-zlib --enable-shared --enable-__cxa_atexit x86_64-suse-linux
> Thread model: posix
> gcc version 3.2.2 (SuSE Linux)
> 
> (gcc-3.2.2-26.x86_64.rpm)
> (glibc-2.2.5-184.x86_64.rpm)
> 
> ldd /usr/local/bin/blastall:
> 
>         libm.so.6 => /lib64/libm.so.6 (0x0000002a9566d000)
>         libpthread.so.0 => /lib64/libpthread.so.0 (0x0000002a957c6000)
>         libc.so.6 => /lib64/libc.so.6 (0x0000002a958e2000)
>         /lib64/ld-linux-x86-64.so.2 => /lib64/ld-linux-x86-64.so.2 
> (0x0000002a95556000)
> 
> 
> _______________________________________
> 
> 
> Databases:
> 
> ncbi:  Human genome scaffold broken into 100KB pieces, 50KB overlap (
> 5.9G )
> 
> sncbi:  same as above but long sequence names converted to shorter form
> (some names were very long and I wanted to make sure this was not an
> name indexing problem)
> 
> htg:  20 August download of NCBI htg sequence file (11G uncompressed)
> 
> _______________________________________
> 
> Formatdb options:
> 
> o:  using '-o T' option for indexing
> 
> no_o:     no -o option
> 
> Other formatdb options used:  '-p F -n <name> -i <fasta_file>'
> 
> _______________________________________
> 
> blastall options:  '-p tblastn -v 3 -b 3 -a 2 -d <db> -i <input_file>'
> 
> _______________________________________
> 
> Input file:  12 protein sequences from fly refseq:
>  >BMSPROT:NP_478140
>  >BMSPROT:NP_523807
>  >BMSPROT:NP_609725
>  >BMSPROT:NP_524716
>  >BMSPROT:NP_524665
>  >BMSPROT:NP_524468
>  >BMSPROT:NP_523392
>  >BMSPROT:NP_572997
>  >BMSPROT:NP_524671
>  >BMSPROT:NP_608480
>  >BMSPROT:NP_524763
>  >BMSPROT:NP_524817
> 
> (I've checked, the 'BMSPROT:' prefix doesn't seem to affect the analysis).
> _______________________________________
> 
> R E S U L T S
> ____________________________________________________________________
> 
> NCBI Toolkit  ncbi-o  ncbi-no_o  sncbi_o  sncbi-no_o htg-o  htg-no_o
> 
> 2.2.1         pass    pass       pass      pass      pass   pass
> 
> 2.2.6         pass    FAIL*      pass      FAIL*     pass   pass
> 
> ____________________________________________________________________
> 
> 
> * - FAIL symptoms include error messages: '[blastall] ERROR: ncbiapi 
> [000.000]
> BMSPROT:NP_478140: ObjMgrChoice: pointer [0] type [1] not found', 
> missing names for
> sequence names of db hits in BLAST summary and sporadic nonsense 
> alignments.
> 
> CONFIGURATION
> 
> IBM,Siemers Opteron linux.ncbi.mk directives for 2.2.6 (April 2003), 
> SUSE 8.1 opteron
> Linux
> 
> NCBI_DEFAULT_LCL = lnx
> NCBI_MAKE_SHELL = /bin/sh
> NCBI_CC = gcc -pipe -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -O3 
> -DOS_UNIX_PPCLINUX  -I../include -I/usr/X11R6/include -L/usr/X11R6/lib64 
> -DWIN_MOTIF
> # should probably be /usr/X11R6/lib64 above on SUSE 8.1
> NCBI_CFLAGS1 = -c
> NCBI_LDFLAGS1 =
> NCBI_OPTFLAG =
> 
> Opteron linux.ncbi.mk directives for 2.2.1 NCBI Toolkit:
> 
> 
> NCBI_DEFAULT_LCL = lnx
> NCBI_MAKE_SHELL = /bin/sh
> NCBI_CC = gcc -pipe -D__USE_FILE_OFFSET64 -D__USE_LARGEFILE64
> NCBI_CFLAGS1 = -c -DOS_UNIX_PPCLINUX
> NCBI_LDFLAGS1 = -O2
> NCBI_OPTFLAG = -O2
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters

-- 
Nathan Siemers|Associate Director|Applied Genomics|Bristol-Myers Squibb 
Pharmaceutical Research
Institute|HW3-0.07|P.O. Box 5400|Princeton, NJ 
08543-5400|(609)818-6568|nathan.siemers@bms.com