Hi Joe, Thanks for the info. I've tested with the -a 1, it does indeed only go wrong with -a 2, so I've kludged it for the time being. However as to your theory about RedHat9 NPTL being involved, I also get exactly the same behaviour on a RedHat7.1 system running ncbi blast 2.2.6. (i.e. goes wrong on nt database but not est database, and only if -a 2, not if -a 1). So I guess if the -a switch changes things its not likely to be bad ram? In reply to your other questions, the output from swapon -s is Filename Type Size Used Priority /dev/sda2 partition 1807304 15036 -1 for the rh7.1 system Filename Type Size Used Priority /dev/sda3 partition 1020116 10496 -1 for the rh9 system. Adding a name line to the query makes no difference. Neither system is overclocked. I've not run the memory checker yet, but I have two identical Redhat9 boxes and they both do it. So that makes 3 systems, and I can test a 4th shortly too. I've not had time to run the graphical debugger - I'm pretty snowed under till Monday. Justin On Fri, 18 Jun 2004, Joe Landman wrote: > Hi Chris and Justin: > > On Thu, 2004-06-17 at 12:38, Chris Dwan wrote: > > Justin, > > > > I've poked around a bit, and run your queries on a variety of machines > > (P-III and Athalon...as well as a few others) which I have sitting > > around the shop here. I was unable to replicate your observed > > behavior. > > Hmmm. I have had crashes when the accession lines were somehow > mangled. But this occurred regardless of memory size. > > [...] > > > On Jun 16, 2004, at 10:46 AM, Justin Powell wrote: > > > > > > > > Hi Chris > > > > > > A short query which goes wrong is > > > > > > actacgactagcatcagctacgctagatgactacgatcagctacgactagcatcgactacg > > > > > > I just have this in a text file on its own with no name line. The nt > > > database I'm using is from the ncbi ftp site blast/db directory and the > > > unzipped database files have the date June 11 2004. > > So you do not have > > >accession data > actacgactagcatcagctacgctagatgactacgatcagctacgactagcatcgactacg > > in the test file, just > > actacgactagcatcagctacgctagatgactacgatcagctacgactagcatcgactacg > > ? > > If this is the case, try making a simple accession line such as > > >abc123|my random label > actacgactagcatcagctacgctagatgactacgatcagctacgactagcatcgactacg > > and see if it still crashes. > > > > I've found the intermittency varies. Sometimes it seems it can be > > > provoked > > > by running a blast against est first, and sometimes it seems to work > > > correctly time after time. > > Oh... If it is not repeatable (e.g. repeatable == same input file always > generates the same error at the same place), then it is likely to be > unrelated to the program itself. That is, the program happens to be > hitting the case in the system which triggers the error. This usually > comes about when you hit a bad physical memory location somewhere, or > you have an OS bug or driver bug of some sort. > > SEGV's usually come about when one process stamps on another processes > memory, so there could be other explanations. If you are swapping to a > partition with some bad bytes, this could be a problem. > > First: Do you have swap enabled? What is the output of > > swapon -s > > Second: What other programs are running? Is this an overclocked system? > > Third: have you run memtest86 on the unit for an extended period of > time? You can pull the memtest86 3.1 iso from > http://downloads.scalableinformatics.com > > > > A second longer sequence I've had go wrong is > > > > > > TCCCCCGAATTTAAACGCGTTGAAAGGGTCATCCTTACTAGAAAAGAGAGTTG > > > ATTCTCTCCGACAGCTTAACACTACCACGGTTAACCAGCTGCTGGGGTTGCCGGGGATGACCTCTACATT > > > CACGGCTCCGCAACTGTTGCAGTTAAGAATAATAGCTATAACTGCGTCTGCCGTGTCCCTTATTGCCGGT > > > TGCCTCGGAATGTTCTTCCTTTCTAAAATGGATAAGAGACGAAAAGTCTTCAGACATGATCTCATCGCAT > > > TTTTGATAATTTGCGACTTTCTTAAAGCTTTTATTCTGATGATTTATCCCATGATTATCCTTATTAATAA > > > TAGTGTGTATGCAACACCTGCATTTTTTAATACCTTGGGTTGGTTTACGGCCTTTGCCATCGAAGGTGCA > > > GACATGGCCATAATGATATTCGCCATACATTTTGCTATTTTGATCTTCAAGCCTAATTGGAAATGGCGAA > > > ATAAAAGATCGGGAAATATGGAGGGTGGCTTGTACAAAAAAAGGTCATATATCTGGCCAATTACTGCATT > > > AGTACCTGCCATTTTAGCAAGCTTAGCCTTCATTAATTATAATAAACTCAATGACGATTCTGACACCACT > > > ATTATACTGGATAATAATAACTACAACTTTCCCGATTCTCCCAGGCAAGGTGGCTACAAACCTTGGAGTG > > > CATGGTGCTATTTACCACCCAAGCCGTACTGGTATAAAATTGTTTTAAGCTGGGGTCCCAGATATTTCAT > > > TATTATTTTCATATTTGCAGTCTACCTCAGTATTTATATTTTCATTACCAGTGAAAGTAAAAGAATTAAA > > > GCGCAAATTGGAGACTTTAACC > > > > > > > > > I've tried recompiling with the -g flag on (and the -O3 flag off) and > > > run > > > gdb on the coredump. However I'm not a c programmer (though I did once > > > read a book on it) and am not at all familiar with either C, gdb or > > > even > > > the details of the call stack, so I'm not sure I've done all this > > > correctly. An example backtrace is like this, though others I've had > > > looked different: > > > > > > [root@prada bin]# gdb blastall core.9520 > > > GNU gdb Red Hat Linux (5.3post-0.20021129.18rh) > > > Copyright 2003 Free Software Foundation, Inc. > > > GDB is free software, covered by the GNU General Public License, and > > > you > > > are > > > welcome to change it and/or distribute copies of it under certain > > > conditions. > > > Type "show copying" to see the conditions. > > > There is absolutely no warranty for GDB. Type "show warranty" for > > > details. > > > This GDB was configured as "i386-redhat-linux-gnu"... > > > Core was generated by `./blastall -p blastn -a 2 -d /usr/blasttest/nt > > > -i > > > /usr/blasttest/tempdna'. > > > Program terminated with signal 11, Segmentation fault. > > > Reading symbols from /lib/tls/libm.so.6...done. > > > Loaded symbols for /lib/tls/libm.so.6 > > > Reading symbols from /lib/tls/libpthread.so.0...done. > > > Loaded symbols for /lib/tls/libpthread.so.0 > > > Reading symbols from /lib/tls/libc.so.6...done. > > > Loaded symbols for /lib/tls/libc.so.6 > > > Reading symbols from /lib/ld-linux.so.2...done. > > > Loaded symbols for /lib/ld-linux.so.2 > > > Reading symbols from /lib/libnss_files.so.2...done. > > > Loaded symbols for /lib/libnss_files.so.2 > > > #0 0x0805ea52 in BlastNtWordFinder (search=0x84363e8, > > > lookup=0x842e6b8) > > > at blast.c:9265 > > > 9265 next_lindex = (((lookup_index) & > > > mask)<<char_size) + *(s+1); > > Ok. This is part of the word search section of BLAST. Basically it > walks along the linear array looking for a match. This should not fail, > though if it does, then the likely problem is in *(s+1). You could > translate *(s+1) as "the contents of the location pointed to by pointer > s incremented by one sizeof data type". If s points to a valid > location, but s+1 does not, it is possible that the memory allocation > somehow failed to allocate sufficient memory for the array (unlikely, > you would have seen this elsewhere). It is also possible that there is > some OS imposed boundary between the values of s and s+1 (the pointers > that is, not their contents), and by accessing the contents > (dereferencing) the pointer as BLAST was doing, you happened to trigger > the protection fault (which is what SEGV is). > > For some reason, the OS thinks that *(s+1) is owned by someone else. > > > > (gdb) backtrace > > > #0 0x0805ea52 in BlastNtWordFinder (search=0x84363e8, > > > lookup=0x842e6b8) > > > at blast.c:9265 > > > #1 0x0805a473 in BlastWordFinder (search=0x84363e8) at blast.c:6847 > > > #2 0x0805a336 in BlastExtendWordSearch (search=0x84363e8, > > > multiple_hits=0 '\0') at blast.c:6803 > > > #3 0x08059d7c in BLASTPerformFinalSearch (search=0x84363e8, > > > subject_length=117793, > > > subject_seq=0x7e12b129 <Address 0x7e12b129 out of bounds>) at > > > blast.c:6612 > > Yup. Looks like memory somehow got mangled. You might have a look at > using ddd (graphical frontend to gdb), and do the run. Then we can look > through the process a bit easier. Basically run the system completely > from the debugger, and see where it crashes, and then poke at it as to > why. > > Note: The location of the crash should not change by running it in the > debugger. If it does, we might start to think more of a hardware > problem (bad swap, bad memory chip, etc) than of a program/OS bug. > > > > #4 0x080596c8 in BLASTPerformSearch (search=0x84363e8, > > > subject_length=117793, > > > subject_seq=0x7e12b129 <Address 0x7e12b129 out of bounds>) at > > > blast.c:6365 > > > #5 0x0805967b in BLASTPerformSearchWithReadDb (search=0x84363e8, > > > sequence_number=1629625) at blast.c:6344 > > > #6 0x0805066f in do_blast_search (ptr=0x84363e8) at blast.c:3335 > > > #7 0x0804d600 in NlmThreadWrapper (wrapper_arg=0x8439c80) at > > > ncbithr.c:647 > > > #8 0x400522b6 in start_thread () from /lib/tls/libpthread.so.0 > > > (gdb) quit > > One more thought. Do you get a crash with -a 1 (or no -a line)? If > not, has your code been compiled on an NPTL box? This has been a common > problem in using NPTL (in RH9) versus linuxthreads, and caused some > interesting crashes (though I seem to remember that they were not > segv's). > > Would you try some of my compiled 2.2.9 binaries or the ones from NCBI > and let us know if you still get the crash? I am thinking this is a > problem in the OS interacting with the program, and not a program bug > per se. If the problem persists across versions, and is repeatable, I > would like to get a copy of the input file which causes it. > > Joe > > -- > Joseph Landman, Ph.D > Scalable Informatics LLC, > email: landman@scalableinformatics.com > web : http://scalableinformatics.com > phone: +1 734 612 4615 > > _______________________________________________ > Bioclusters maillist - Bioclusters@bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters >