SOLVED Re: [Bioclusters] Opteron Perl64 segfault issues

Sat, 30 Aug 2003 09:22:32 -0400

Hello Tim,

	There are exact examples of the abberant blast output in my 16 august 
post to the list.  Yes, I'll definitely submit a bug report to the NCBI.

	To our knowledge, the -A F option obviates the need for the other 
workarounds.

	Someone told me that there was a flaming thread about this on the 
emboss lists (the bug breaking lots of stuff in emboss), but I haven't 
checked that list.

	Take care,

	Nathan

Tim Harsch wrote:
> I would like to get some clarification from you because I have some
> processes that do not use the -A F parameter, but do not use the ASN.1
> deflines.  I'm worried this issue may be causing problems I'm not yet aware
> of.  Can you summarize what the exact symptoms are and include
> blast-help@ncbi.nlm.nih.gov in your reply so that they might have a chance
> to fix the problem in future releases.
> 
> Also, setting -A F, obviates the need for the workarounds you talked about
> right?
> 
> ----- Original Message ----- 
> From: "Nathan O. Siemers" <Nathan.Siemers@bms.com>
> To: <bioclusters@bioinformatics.org>
> Sent: Wednesday, August 27, 2003 5:48 AM
> Subject: SOLVED Re: [Bioclusters] Opteron Perl64 segfault issues
> 
> 
> 
>>
>>Sorry for the Opteron spam, but I hope this will help folks doing this
>>in the future ;)
>>
>>We now believe that the abberant behavior in NCBI blast in some
>>configurations can be completely traced to a single character change in
>>the source code...
>>
>>In recent releases of the ncbi toolkit, the formatdb options to create
>>ASN.1 structured deflines (-A) has been turned on by default, a
>>divergence from previous behavior.  Unpredictable (and wrong!) things
>>happen when sequences are input to formatdb that do not follow the
>>arcane NCBI fasta naming terminology (foo|bar|etc|blah) when this option
>>is selected.  In our case, we were using very simple naming conventions:
>>
>> >name1
>> >name2
>> >name3
>>
>>(ncbi would have demanded something like >lcl|name1  )
>>
>>
>>etc.  This is not compatible with the new default behavior of formatdb.
>>
>>Solution:  if you do not follow the NCBI fasta naming structure exactly,
>>use the -A F option of formatdb and/or change the default in formatdb.c.
>>
>>NCBI toolkit versions somewhere after 2.2.1 have this problem.
>>
>>Classic NCBI.
>>
>>Nathan
>>
>>
>>
>>
>>
>>
>>
>>
>>Nathan O. Siemers wrote:
>>
>>>All:
>>>
>>>    Joe Landman from Scalable Informatics, Lawrence Hannon from IBM, and
>>>I have been working on issues running blast on the AMD opteron platform.
>>>I've summarized my results (with much help from Joe and Lawrence) in
>>>validating the blastall and formatdb code.  There are quirks with the
>>>latest versions of the NCBI toolkit, producing corrupt blast results in
>>>some situations.  They only appear with some (large) databases but we
>>>are not sure what exactly causes this behavior at the present time.  We
>>>have tentative workarounds, listed below.
>>>
>>>
>>>Thanks to everyone who has helped me over the past few weeks - the
>>>bottom line is that *none* of the problems I have seen over the past
>>>weeks could actually be traced to problems with Opteron hardware (other
>>>than a RAM chip) or Linux OS.  This is great news for Opteron.
>>>
>>>
>>>
>>>SUMMARY
>>>
>>>Builds of formatdb and blastall from the NCBI Toolkit version 2.2.6
>>>can produce corrupted output when used with some formatdb parameters
>>>in all builds so far tested on the AMD Opteron 64 bit platform.
>>>Symptoms include failure to produce a correctly named .nal or .pal
>>>file when databases are split up into volumes.  Pointer errors produce
>>>incorrect results and alignments with some large databases.  NCBI
>>>Toolkit 2.2.1 does not show this behavior.  Some of these errors have
>>>been reproduced by us on SGI MIPS IRIX platforms with SGI compilers,
>>>suggesting that the errors are neither Opteron nor compiler specific.
>>>
>>>
>>>
>>>
>>>
>>>Current workarounds are to:
>>>
>>>    1.  explicitly name the formatdb output database with the -n option
>>>
>>>    2.  use the '-o T' option in formatdb to alter the way blast indices
>>>        are created.
>>>
>>>    Alternatively:
>>>
>>>    3.  Use the 2.2.1 version of the blastall tools.
>>>
>>>
>>>
>>>
>>>
>>>_______________________________________
>>>
>>>TESTS
>>>
>>>Machine, OS, libs:
>>>
>>>2 CPU AMD Opteron (Penguin), 6G RAM, SUSE Linux 8, 2.4.19 SMP Linux
>>>Kernel.
>>>
>>>Current configuration:
>>>
>>>opt:/gcgblast # gcc -v
>>>Reading specs from /usr/lib64/gcc-lib/x86_64-suse-linux/3.2.2/specs
>>>Configured with: ../configure --enable-threads=posix --prefix=/usr
>>>--with-local-prefix=/usr/local --infodir=/usr/share/info
>>>--mandir=/usr/share/man --libdir=/usr/lib64
>>>--enable-languages=c,c++,f77,objc,java,ada --enable-libgcj
>>>--with-gxx-include-dir=/usr/include/g++ --with-slibdir=/lib
>>>--with-system-zlib --enable-shared --enable-__cxa_atexit
> 
> x86_64-suse-linux
> 
>>>Thread model: posix
>>>gcc version 3.2.2 (SuSE Linux)
>>>
>>>(gcc-3.2.2-26.x86_64.rpm)
>>>(glibc-2.2.5-184.x86_64.rpm)
>>>
>>>ldd /usr/local/bin/blastall:
>>>
>>>        libm.so.6 => /lib64/libm.so.6 (0x0000002a9566d000)
>>>        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000002a957c6000)
>>>        libc.so.6 => /lib64/libc.so.6 (0x0000002a958e2000)
>>>        /lib64/ld-linux-x86-64.so.2 => /lib64/ld-linux-x86-64.so.2
>>>(0x0000002a95556000)
>>>
>>>
>>>_______________________________________
>>>
>>>
>>>Databases:
>>>
>>>ncbi:  Human genome scaffold broken into 100KB pieces, 50KB overlap (
>>>5.9G )
>>>
>>>sncbi:  same as above but long sequence names converted to shorter form
>>>(some names were very long and I wanted to make sure this was not an
>>>name indexing problem)
>>>
>>>htg:  20 August download of NCBI htg sequence file (11G uncompressed)
>>>
>>>_______________________________________
>>>
>>>Formatdb options:
>>>
>>>o:  using '-o T' option for indexing
>>>
>>>no_o:     no -o option
>>>
>>>Other formatdb options used:  '-p F -n <name> -i <fasta_file>'
>>>
>>>_______________________________________
>>>
>>>blastall options:  '-p tblastn -v 3 -b 3 -a 2 -d <db> -i <input_file>'
>>>
>>>_______________________________________
>>>
>>>Input file:  12 protein sequences from fly refseq:
>>> >BMSPROT:NP_478140
>>> >BMSPROT:NP_523807
>>> >BMSPROT:NP_609725
>>> >BMSPROT:NP_524716
>>> >BMSPROT:NP_524665
>>> >BMSPROT:NP_524468
>>> >BMSPROT:NP_523392
>>> >BMSPROT:NP_572997
>>> >BMSPROT:NP_524671
>>> >BMSPROT:NP_608480
>>> >BMSPROT:NP_524763
>>> >BMSPROT:NP_524817
>>>
>>>(I've checked, the 'BMSPROT:' prefix doesn't seem to affect the
> 
> analysis).
> 
>>>_______________________________________
>>>
>>>R E S U L T S
>>>____________________________________________________________________
>>>
>>>NCBI Toolkit  ncbi-o  ncbi-no_o  sncbi_o  sncbi-no_o htg-o  htg-no_o
>>>
>>>2.2.1         pass    pass       pass      pass      pass   pass
>>>
>>>2.2.6         pass    FAIL*      pass      FAIL*     pass   pass
>>>
>>>____________________________________________________________________
>>>
>>>
>>>* - FAIL symptoms include error messages: '[blastall] ERROR: ncbiapi
>>>[000.000]
>>>BMSPROT:NP_478140: ObjMgrChoice: pointer [0] type [1] not found',
>>>missing names for
>>>sequence names of db hits in BLAST summary and sporadic nonsense
>>>alignments.
>>>
>>>CONFIGURATION
>>>
>>>IBM,Siemers Opteron linux.ncbi.mk directives for 2.2.6 (April 2003),
>>>SUSE 8.1 opteron
>>>Linux
>>>
>>>NCBI_DEFAULT_LCL = lnx
>>>NCBI_MAKE_SHELL = /bin/sh
>>>NCBI_CC = gcc -pipe -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -O3
>>>-DOS_UNIX_PPCLINUX  -I../include -I/usr/X11R6/include -L/usr/X11R6/lib64
>>>-DWIN_MOTIF
>>># should probably be /usr/X11R6/lib64 above on SUSE 8.1
>>>NCBI_CFLAGS1 = -c
>>>NCBI_LDFLAGS1 =
>>>NCBI_OPTFLAG =
>>>
>>>Opteron linux.ncbi.mk directives for 2.2.1 NCBI Toolkit:
>>>
>>>
>>>NCBI_DEFAULT_LCL = lnx
>>>NCBI_MAKE_SHELL = /bin/sh
>>>NCBI_CC = gcc -pipe -D__USE_FILE_OFFSET64 -D__USE_LARGEFILE64
>>>NCBI_CFLAGS1 = -c -DOS_UNIX_PPCLINUX
>>>NCBI_LDFLAGS1 = -O2
>>>NCBI_OPTFLAG = -O2
>>>
>>>_______________________________________________
>>>Bioclusters maillist  -  Bioclusters@bioinformatics.org
>>>https://bioinformatics.org/mailman/listinfo/bioclusters
>>
>>-- 
>>Nathan Siemers|Associate Director|Applied Genomics|Bristol-Myers Squibb
>>Pharmaceutical Research
>>Institute|HW3-0.07|P.O. Box 5400|Princeton, NJ
>>08543-5400|(609)818-6568|nathan.siemers@bms.com
>>
>>_______________________________________________
>>Bioclusters maillist  -  Bioclusters@bioinformatics.org
>>https://bioinformatics.org/mailman/listinfo/bioclusters
> 
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters

-- 
Nathan Siemers|Associate Director|Applied Genomics|Bristol-Myers Squibb 
Pharmaceutical Research
Institute|HW3-0.07|P.O. Box 5400|Princeton, NJ 
08543-5400|(609)818-6568|nathan.siemers@bms.com