[Bioclusters] ncbi blast

Fri, 18 Jun 2004 07:24:49 -0400

Hi Chris and Justin:

On Thu, 2004-06-17 at 12:38, Chris Dwan wrote:
> Justin,
> 
> I've poked around a bit, and run your queries on a variety of machines 
> (P-III and Athalon...as well as a few others) which I have sitting 
> around the shop here.  I was unable to replicate your observed 
> behavior.

Hmmm.  I have had crashes when the accession lines were somehow
mangled.  But this occurred regardless of memory size.

[...]

> On Jun 16, 2004, at 10:46 AM, Justin Powell wrote:
> 
> >
> > Hi Chris
> >
> > A short query which goes wrong is
> >
> > actacgactagcatcagctacgctagatgactacgatcagctacgactagcatcgactacg
> >
> > I just have this in a text file on its own with no name line. The nt
> > database I'm using is from the ncbi ftp site blast/db directory and the
> > unzipped database files have the date June 11 2004.

So you do not have 

	>accession data
	actacgactagcatcagctacgctagatgactacgatcagctacgactagcatcgactacg

in the test file, just

	actacgactagcatcagctacgctagatgactacgatcagctacgactagcatcgactacg	

?

If this is the case, try making a simple accession line such as

	>abc123|my random label
	actacgactagcatcagctacgctagatgactacgatcagctacgactagcatcgactacg	

and see if it still crashes.

> > I've found the intermittency varies. Sometimes it seems it can be 
> > provoked
> > by running a blast against est first, and sometimes it seems to work
> > correctly time after time.

Oh... If it is not repeatable (e.g. repeatable == same input file always
generates the same error at the same place), then it is likely to be
unrelated to the program itself.  That is, the program happens to be
hitting the case in the system which triggers the error.  This usually
comes about when you hit a bad physical memory location somewhere, or
you have an OS bug or driver bug of some sort.  

SEGV's usually come about when one process stamps on another processes
memory, so there could be other explanations.  If you are swapping to a
partition with some bad bytes, this could be a problem.

First:  Do you have swap enabled?  What is the output of

	swapon -s

Second: What other programs are running?  Is this an overclocked system?

Third:  have you run memtest86 on the unit for an extended period of
time?  You can pull the memtest86 3.1 iso from
http://downloads.scalableinformatics.com

> > A second longer sequence I've had go wrong is
> >
> > TCCCCCGAATTTAAACGCGTTGAAAGGGTCATCCTTACTAGAAAAGAGAGTTG
> > ATTCTCTCCGACAGCTTAACACTACCACGGTTAACCAGCTGCTGGGGTTGCCGGGGATGACCTCTACATT
> > CACGGCTCCGCAACTGTTGCAGTTAAGAATAATAGCTATAACTGCGTCTGCCGTGTCCCTTATTGCCGGT
> > TGCCTCGGAATGTTCTTCCTTTCTAAAATGGATAAGAGACGAAAAGTCTTCAGACATGATCTCATCGCAT
> > TTTTGATAATTTGCGACTTTCTTAAAGCTTTTATTCTGATGATTTATCCCATGATTATCCTTATTAATAA
> > TAGTGTGTATGCAACACCTGCATTTTTTAATACCTTGGGTTGGTTTACGGCCTTTGCCATCGAAGGTGCA
> > GACATGGCCATAATGATATTCGCCATACATTTTGCTATTTTGATCTTCAAGCCTAATTGGAAATGGCGAA
> > ATAAAAGATCGGGAAATATGGAGGGTGGCTTGTACAAAAAAAGGTCATATATCTGGCCAATTACTGCATT
> > AGTACCTGCCATTTTAGCAAGCTTAGCCTTCATTAATTATAATAAACTCAATGACGATTCTGACACCACT
> > ATTATACTGGATAATAATAACTACAACTTTCCCGATTCTCCCAGGCAAGGTGGCTACAAACCTTGGAGTG
> > CATGGTGCTATTTACCACCCAAGCCGTACTGGTATAAAATTGTTTTAAGCTGGGGTCCCAGATATTTCAT
> > TATTATTTTCATATTTGCAGTCTACCTCAGTATTTATATTTTCATTACCAGTGAAAGTAAAAGAATTAAA
> > GCGCAAATTGGAGACTTTAACC
> >
> >
> > I've tried recompiling with the -g flag on (and the -O3 flag off) and 
> > run
> > gdb on the coredump. However I'm not a c programmer (though I did once
> > read a book on it) and am not at all familiar with either C, gdb or 
> > even
> > the details of the call stack, so I'm not sure I've done all this
> > correctly. An example backtrace is like this, though others I've had
> > looked different:
> >
> > [root@prada bin]# gdb blastall core.9520
> > GNU gdb Red Hat Linux (5.3post-0.20021129.18rh)
> > Copyright 2003 Free Software Foundation, Inc.
> > GDB is free software, covered by the GNU General Public License, and 
> > you
> > are
> > welcome to change it and/or distribute copies of it under certain
> > conditions.
> > Type "show copying" to see the conditions.
> > There is absolutely no warranty for GDB.  Type "show warranty" for
> > details.
> > This GDB was configured as "i386-redhat-linux-gnu"...
> > Core was generated by `./blastall -p blastn -a 2 -d /usr/blasttest/nt 
> > -i
> > /usr/blasttest/tempdna'.
> > Program terminated with signal 11, Segmentation fault.
> > Reading symbols from /lib/tls/libm.so.6...done.
> > Loaded symbols for /lib/tls/libm.so.6
> > Reading symbols from /lib/tls/libpthread.so.0...done.
> > Loaded symbols for /lib/tls/libpthread.so.0
> > Reading symbols from /lib/tls/libc.so.6...done.
> > Loaded symbols for /lib/tls/libc.so.6
> > Reading symbols from /lib/ld-linux.so.2...done.
> > Loaded symbols for /lib/ld-linux.so.2
> > Reading symbols from /lib/libnss_files.so.2...done.
> > Loaded symbols for /lib/libnss_files.so.2
> > #0  0x0805ea52 in BlastNtWordFinder (search=0x84363e8, 
> > lookup=0x842e6b8)
> >     at blast.c:9265
> > 9265			 next_lindex = (((lookup_index) &
> > mask)<<char_size) + *(s+1);

Ok.  This is part of the word search section of BLAST.  Basically it
walks along the linear array looking for a match.  This should not fail,
though if it does, then the likely problem is in  *(s+1).  You could
translate *(s+1) as "the contents of the location pointed to by pointer
s incremented by one sizeof data type".  If s points to a valid
location, but s+1 does not, it is possible that the memory allocation
somehow failed to allocate sufficient memory for the array (unlikely,
you would have seen this elsewhere).  It is also possible that there is
some OS imposed boundary between the values of s and s+1 (the pointers
that is, not their contents), and by accessing the contents
(dereferencing) the pointer as BLAST was doing, you happened to trigger
the protection fault (which is what SEGV is).

For some reason, the OS thinks that *(s+1) is owned by someone else.

> > (gdb) backtrace
> > #0  0x0805ea52 in BlastNtWordFinder (search=0x84363e8, 
> > lookup=0x842e6b8)
> >     at blast.c:9265
> > #1  0x0805a473 in BlastWordFinder (search=0x84363e8) at blast.c:6847
> > #2  0x0805a336 in BlastExtendWordSearch (search=0x84363e8,
> >     multiple_hits=0 '\0') at blast.c:6803
> > #3  0x08059d7c in BLASTPerformFinalSearch (search=0x84363e8,
> >     subject_length=117793,
> >     subject_seq=0x7e12b129 <Address 0x7e12b129 out of bounds>) at
> > blast.c:6612

Yup.  Looks like memory somehow got mangled. You might have a look at
using ddd (graphical frontend to gdb), and do the run.  Then we can look
through the process a bit easier.  Basically run the system completely
from the debugger, and see where it crashes, and then poke at it as to
why.

Note:  The location of the crash should not change by running it in the
debugger.  If it does, we might start to think more of a hardware
problem (bad swap, bad memory chip, etc) than of a program/OS bug.

> > #4  0x080596c8 in BLASTPerformSearch (search=0x84363e8,
> > subject_length=117793,
> >     subject_seq=0x7e12b129 <Address 0x7e12b129 out of bounds>) at
> > blast.c:6365
> > #5  0x0805967b in BLASTPerformSearchWithReadDb (search=0x84363e8,
> >     sequence_number=1629625) at blast.c:6344
> > #6  0x0805066f in do_blast_search (ptr=0x84363e8) at blast.c:3335
> > #7  0x0804d600 in NlmThreadWrapper (wrapper_arg=0x8439c80) at
> > ncbithr.c:647
> > #8  0x400522b6 in start_thread () from /lib/tls/libpthread.so.0
> > (gdb) quit

One more thought.  Do you get a crash with -a 1 (or no -a line)?  If
not, has your code been compiled on an NPTL box?  This has been a common
problem in using NPTL (in RH9) versus linuxthreads, and caused some
interesting crashes (though I seem to remember that they were not
segv's).

Would you try some of my compiled 2.2.9 binaries or the ones from NCBI
and let us know if you still get the crash?  I am thinking this is a
problem in the OS interacting with the program, and not a program bug
per se.  If the problem persists across versions, and is repeatable, I
would like to get a copy of the input file which causes it.

Joe

-- 
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 612 4615