Hi Chris and Justin: On Thu, 2004-06-17 at 12:38, Chris Dwan wrote: > Justin, > > I've poked around a bit, and run your queries on a variety of machines > (P-III and Athalon...as well as a few others) which I have sitting > around the shop here. I was unable to replicate your observed > behavior. Hmmm. I have had crashes when the accession lines were somehow mangled. But this occurred regardless of memory size. [...] > On Jun 16, 2004, at 10:46 AM, Justin Powell wrote: > > > > > Hi Chris > > > > A short query which goes wrong is > > > > actacgactagcatcagctacgctagatgactacgatcagctacgactagcatcgactacg > > > > I just have this in a text file on its own with no name line. The nt > > database I'm using is from the ncbi ftp site blast/db directory and the > > unzipped database files have the date June 11 2004. So you do not have >accession data actacgactagcatcagctacgctagatgactacgatcagctacgactagcatcgactacg in the test file, just actacgactagcatcagctacgctagatgactacgatcagctacgactagcatcgactacg ? If this is the case, try making a simple accession line such as >abc123|my random label actacgactagcatcagctacgctagatgactacgatcagctacgactagcatcgactacg and see if it still crashes. > > I've found the intermittency varies. Sometimes it seems it can be > > provoked > > by running a blast against est first, and sometimes it seems to work > > correctly time after time. Oh... If it is not repeatable (e.g. repeatable == same input file always generates the same error at the same place), then it is likely to be unrelated to the program itself. That is, the program happens to be hitting the case in the system which triggers the error. This usually comes about when you hit a bad physical memory location somewhere, or you have an OS bug or driver bug of some sort. SEGV's usually come about when one process stamps on another processes memory, so there could be other explanations. If you are swapping to a partition with some bad bytes, this could be a problem. First: Do you have swap enabled? What is the output of swapon -s Second: What other programs are running? Is this an overclocked system? Third: have you run memtest86 on the unit for an extended period of time? You can pull the memtest86 3.1 iso from http://downloads.scalableinformatics.com > > A second longer sequence I've had go wrong is > > > > TCCCCCGAATTTAAACGCGTTGAAAGGGTCATCCTTACTAGAAAAGAGAGTTG > > ATTCTCTCCGACAGCTTAACACTACCACGGTTAACCAGCTGCTGGGGTTGCCGGGGATGACCTCTACATT > > CACGGCTCCGCAACTGTTGCAGTTAAGAATAATAGCTATAACTGCGTCTGCCGTGTCCCTTATTGCCGGT > > TGCCTCGGAATGTTCTTCCTTTCTAAAATGGATAAGAGACGAAAAGTCTTCAGACATGATCTCATCGCAT > > TTTTGATAATTTGCGACTTTCTTAAAGCTTTTATTCTGATGATTTATCCCATGATTATCCTTATTAATAA > > TAGTGTGTATGCAACACCTGCATTTTTTAATACCTTGGGTTGGTTTACGGCCTTTGCCATCGAAGGTGCA > > GACATGGCCATAATGATATTCGCCATACATTTTGCTATTTTGATCTTCAAGCCTAATTGGAAATGGCGAA > > ATAAAAGATCGGGAAATATGGAGGGTGGCTTGTACAAAAAAAGGTCATATATCTGGCCAATTACTGCATT > > AGTACCTGCCATTTTAGCAAGCTTAGCCTTCATTAATTATAATAAACTCAATGACGATTCTGACACCACT > > ATTATACTGGATAATAATAACTACAACTTTCCCGATTCTCCCAGGCAAGGTGGCTACAAACCTTGGAGTG > > CATGGTGCTATTTACCACCCAAGCCGTACTGGTATAAAATTGTTTTAAGCTGGGGTCCCAGATATTTCAT > > TATTATTTTCATATTTGCAGTCTACCTCAGTATTTATATTTTCATTACCAGTGAAAGTAAAAGAATTAAA > > GCGCAAATTGGAGACTTTAACC > > > > > > I've tried recompiling with the -g flag on (and the -O3 flag off) and > > run > > gdb on the coredump. However I'm not a c programmer (though I did once > > read a book on it) and am not at all familiar with either C, gdb or > > even > > the details of the call stack, so I'm not sure I've done all this > > correctly. An example backtrace is like this, though others I've had > > looked different: > > > > [root@prada bin]# gdb blastall core.9520 > > GNU gdb Red Hat Linux (5.3post-0.20021129.18rh) > > Copyright 2003 Free Software Foundation, Inc. > > GDB is free software, covered by the GNU General Public License, and > > you > > are > > welcome to change it and/or distribute copies of it under certain > > conditions. > > Type "show copying" to see the conditions. > > There is absolutely no warranty for GDB. Type "show warranty" for > > details. > > This GDB was configured as "i386-redhat-linux-gnu"... > > Core was generated by `./blastall -p blastn -a 2 -d /usr/blasttest/nt > > -i > > /usr/blasttest/tempdna'. > > Program terminated with signal 11, Segmentation fault. > > Reading symbols from /lib/tls/libm.so.6...done. > > Loaded symbols for /lib/tls/libm.so.6 > > Reading symbols from /lib/tls/libpthread.so.0...done. > > Loaded symbols for /lib/tls/libpthread.so.0 > > Reading symbols from /lib/tls/libc.so.6...done. > > Loaded symbols for /lib/tls/libc.so.6 > > Reading symbols from /lib/ld-linux.so.2...done. > > Loaded symbols for /lib/ld-linux.so.2 > > Reading symbols from /lib/libnss_files.so.2...done. > > Loaded symbols for /lib/libnss_files.so.2 > > #0 0x0805ea52 in BlastNtWordFinder (search=0x84363e8, > > lookup=0x842e6b8) > > at blast.c:9265 > > 9265 next_lindex = (((lookup_index) & > > mask)<<char_size) + *(s+1); Ok. This is part of the word search section of BLAST. Basically it walks along the linear array looking for a match. This should not fail, though if it does, then the likely problem is in *(s+1). You could translate *(s+1) as "the contents of the location pointed to by pointer s incremented by one sizeof data type". If s points to a valid location, but s+1 does not, it is possible that the memory allocation somehow failed to allocate sufficient memory for the array (unlikely, you would have seen this elsewhere). It is also possible that there is some OS imposed boundary between the values of s and s+1 (the pointers that is, not their contents), and by accessing the contents (dereferencing) the pointer as BLAST was doing, you happened to trigger the protection fault (which is what SEGV is). For some reason, the OS thinks that *(s+1) is owned by someone else. > > (gdb) backtrace > > #0 0x0805ea52 in BlastNtWordFinder (search=0x84363e8, > > lookup=0x842e6b8) > > at blast.c:9265 > > #1 0x0805a473 in BlastWordFinder (search=0x84363e8) at blast.c:6847 > > #2 0x0805a336 in BlastExtendWordSearch (search=0x84363e8, > > multiple_hits=0 '\0') at blast.c:6803 > > #3 0x08059d7c in BLASTPerformFinalSearch (search=0x84363e8, > > subject_length=117793, > > subject_seq=0x7e12b129 <Address 0x7e12b129 out of bounds>) at > > blast.c:6612 Yup. Looks like memory somehow got mangled. You might have a look at using ddd (graphical frontend to gdb), and do the run. Then we can look through the process a bit easier. Basically run the system completely from the debugger, and see where it crashes, and then poke at it as to why. Note: The location of the crash should not change by running it in the debugger. If it does, we might start to think more of a hardware problem (bad swap, bad memory chip, etc) than of a program/OS bug. > > #4 0x080596c8 in BLASTPerformSearch (search=0x84363e8, > > subject_length=117793, > > subject_seq=0x7e12b129 <Address 0x7e12b129 out of bounds>) at > > blast.c:6365 > > #5 0x0805967b in BLASTPerformSearchWithReadDb (search=0x84363e8, > > sequence_number=1629625) at blast.c:6344 > > #6 0x0805066f in do_blast_search (ptr=0x84363e8) at blast.c:3335 > > #7 0x0804d600 in NlmThreadWrapper (wrapper_arg=0x8439c80) at > > ncbithr.c:647 > > #8 0x400522b6 in start_thread () from /lib/tls/libpthread.so.0 > > (gdb) quit One more thought. Do you get a crash with -a 1 (or no -a line)? If not, has your code been compiled on an NPTL box? This has been a common problem in using NPTL (in RH9) versus linuxthreads, and caused some interesting crashes (though I seem to remember that they were not segv's). Would you try some of my compiled 2.2.9 binaries or the ones from NCBI and let us know if you still get the crash? I am thinking this is a problem in the OS interacting with the program, and not a program bug per se. If the problem persists across versions, and is repeatable, I would like to get a copy of the input file which causes it. Joe -- Joseph Landman, Ph.D Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://scalableinformatics.com phone: +1 734 612 4615