[Bioclusters] ncbi blast

Wed, 23 Jun 2004 10:16:15 -0400

On Wed, 2004-06-23 at 10:02, Justin Powell wrote:
> Hi Joe,
> 
> Thanks for the info.  I've tested with the -a 1, it does indeed only go
> wrong with -a 2, so I've kludged it for the time being.  However as to

Interesting.  This does implicate the threading somehow.  The "-a N"
invokes the pthread library paths.  

> your theory about RedHat9 NPTL being involved, I also get exactly the same
> behaviour on a RedHat7.1 system running ncbi blast 2.2.6. (i.e. goes wrong
> on nt database but not est database, and only if -a 2, not if -a 1).

These are SMP systems I presume.  

> So I guess if the -a switch changes things its not likely to be bad ram?

Well, I would think that it makes that possibility more remote.  It is a
good idea to beat on the systems overnight or over a weekend with
Memtest86 3.1 just to be sure (not a guarantee, but a good filter for
failing stuff).

Is this behavior seen with other compiled binary versions of the the
libraries?  If you could wrap the blastall execution with a

	strace -f -o trace.blastall ...

where ... is the blastall -a 2 [ ] command you have which fails.  The
trace.blastall will be pretty large.  Don't post it here, try
compressing it and mailing, or let me know and I can enable a one off
ftp.

The idea is that with -a 2 the program tries to use the threading
library.  If it is dying, it should return an error message in the
threading calls, which we are not seeing.  You might also wish to make
sure the code sees all the libraries it needs by doing a

	ldd /path/to/blastall

and making sure that none of the libraries say "not found".  That output
would be interesting to see here.

> 
> In reply to your other questions, the output from swapon -s is
> 
> Filename			Type		Size	Used	Priority
> /dev/sda2                       partition	1807304	15036	-1

Ok, this is good.

> 
> for the rh7.1 system
> 
> Filename			Type		Size	Used	Priority
> /dev/sda3                       partition	1020116	10496	-1
> 
> for the rh9 system.
> 

Interesting.  You have enough swap.  It seems to be unlikely to be a VM
issue.

> Adding a name line to the query makes no difference.
> 
> Neither system is overclocked. I've not run the memory checker yet, but I
> have two identical Redhat9 boxes and they both do it. So that makes 3
> systems, and I can test a 4th shortly too.
> 
> I've not had time to run the graphical debugger - I'm pretty snowed under
> till Monday.

Ok.  Have you isolated it to a single sequence and a single db?  This
would let some of us try it.

Joe

> 
> Justin
> 
> On Fri, 18 Jun 2004, Joe Landman wrote:
> 
> > Hi Chris and Justin:
> >
> > On Thu, 2004-06-17 at 12:38, Chris Dwan wrote:
> > > Justin,
> > >
> > > I've poked around a bit, and run your queries on a variety of machines
> > > (P-III and Athalon...as well as a few others) which I have sitting
> > > around the shop here.  I was unable to replicate your observed
> > > behavior.
> >
> > Hmmm.  I have had crashes when the accession lines were somehow
> > mangled.  But this occurred regardless of memory size.
> >
> > [...]
> >
> > > On Jun 16, 2004, at 10:46 AM, Justin Powell wrote:
> > >
> > > >
> > > > Hi Chris
> > > >
> > > > A short query which goes wrong is
> > > >
> > > > actacgactagcatcagctacgctagatgactacgatcagctacgactagcatcgactacg
> > > >
> > > > I just have this in a text file on its own with no name line. The nt
> > > > database I'm using is from the ncbi ftp site blast/db directory and the
> > > > unzipped database files have the date June 11 2004.
> >
> > So you do not have
> >
> > 	>accession data
> > 	actacgactagcatcagctacgctagatgactacgatcagctacgactagcatcgactacg
> >
> > in the test file, just
> >
> > 	actacgactagcatcagctacgctagatgactacgatcagctacgactagcatcgactacg
> >
> > ?
> >
> > If this is the case, try making a simple accession line such as
> >
> > 	>abc123|my random label
> > 	actacgactagcatcagctacgctagatgactacgatcagctacgactagcatcgactacg
> >
> > and see if it still crashes.
> >
> > > > I've found the intermittency varies. Sometimes it seems it can be
> > > > provoked
> > > > by running a blast against est first, and sometimes it seems to work
> > > > correctly time after time.
> >
> > Oh... If it is not repeatable (e.g. repeatable == same input file always
> > generates the same error at the same place), then it is likely to be
> > unrelated to the program itself.  That is, the program happens to be
> > hitting the case in the system which triggers the error.  This usually
> > comes about when you hit a bad physical memory location somewhere, or
> > you have an OS bug or driver bug of some sort.
> >
> > SEGV's usually come about when one process stamps on another processes
> > memory, so there could be other explanations.  If you are swapping to a
> > partition with some bad bytes, this could be a problem.
> >
> > First:  Do you have swap enabled?  What is the output of
> >
> > 	swapon -s
> >
> > Second: What other programs are running?  Is this an overclocked system?
> >
> > Third:  have you run memtest86 on the unit for an extended period of
> > time?  You can pull the memtest86 3.1 iso from
> > http://downloads.scalableinformatics.com
> >
> > > > A second longer sequence I've had go wrong is
> > > >
> > > > TCCCCCGAATTTAAACGCGTTGAAAGGGTCATCCTTACTAGAAAAGAGAGTTG
> > > > ATTCTCTCCGACAGCTTAACACTACCACGGTTAACCAGCTGCTGGGGTTGCCGGGGATGACCTCTACATT
> > > > CACGGCTCCGCAACTGTTGCAGTTAAGAATAATAGCTATAACTGCGTCTGCCGTGTCCCTTATTGCCGGT
> > > > TGCCTCGGAATGTTCTTCCTTTCTAAAATGGATAAGAGACGAAAAGTCTTCAGACATGATCTCATCGCAT
> > > > TTTTGATAATTTGCGACTTTCTTAAAGCTTTTATTCTGATGATTTATCCCATGATTATCCTTATTAATAA
> > > > TAGTGTGTATGCAACACCTGCATTTTTTAATACCTTGGGTTGGTTTACGGCCTTTGCCATCGAAGGTGCA
> > > > GACATGGCCATAATGATATTCGCCATACATTTTGCTATTTTGATCTTCAAGCCTAATTGGAAATGGCGAA
> > > > ATAAAAGATCGGGAAATATGGAGGGTGGCTTGTACAAAAAAAGGTCATATATCTGGCCAATTACTGCATT
> > > > AGTACCTGCCATTTTAGCAAGCTTAGCCTTCATTAATTATAATAAACTCAATGACGATTCTGACACCACT
> > > > ATTATACTGGATAATAATAACTACAACTTTCCCGATTCTCCCAGGCAAGGTGGCTACAAACCTTGGAGTG
> > > > CATGGTGCTATTTACCACCCAAGCCGTACTGGTATAAAATTGTTTTAAGCTGGGGTCCCAGATATTTCAT
> > > > TATTATTTTCATATTTGCAGTCTACCTCAGTATTTATATTTTCATTACCAGTGAAAGTAAAAGAATTAAA
> > > > GCGCAAATTGGAGACTTTAACC
> > > >
> > > >
> > > > I've tried recompiling with the -g flag on (and the -O3 flag off) and
> > > > run
> > > > gdb on the coredump. However I'm not a c programmer (though I did once
> > > > read a book on it) and am not at all familiar with either C, gdb or
> > > > even
> > > > the details of the call stack, so I'm not sure I've done all this
> > > > correctly. An example backtrace is like this, though others I've had
> > > > looked different:
> > > >
> > > > [root@prada bin]# gdb blastall core.9520
> > > > GNU gdb Red Hat Linux (5.3post-0.20021129.18rh)
> > > > Copyright 2003 Free Software Foundation, Inc.
> > > > GDB is free software, covered by the GNU General Public License, and
> > > > you
> > > > are
> > > > welcome to change it and/or distribute copies of it under certain
> > > > conditions.
> > > > Type "show copying" to see the conditions.
> > > > There is absolutely no warranty for GDB.  Type "show warranty" for
> > > > details.
> > > > This GDB was configured as "i386-redhat-linux-gnu"...
> > > > Core was generated by `./blastall -p blastn -a 2 -d /usr/blasttest/nt
> > > > -i
> > > > /usr/blasttest/tempdna'.
> > > > Program terminated with signal 11, Segmentation fault.
> > > > Reading symbols from /lib/tls/libm.so.6...done.
> > > > Loaded symbols for /lib/tls/libm.so.6
> > > > Reading symbols from /lib/tls/libpthread.so.0...done.
> > > > Loaded symbols for /lib/tls/libpthread.so.0
> > > > Reading symbols from /lib/tls/libc.so.6...done.
> > > > Loaded symbols for /lib/tls/libc.so.6
> > > > Reading symbols from /lib/ld-linux.so.2...done.
> > > > Loaded symbols for /lib/ld-linux.so.2
> > > > Reading symbols from /lib/libnss_files.so.2...done.
> > > > Loaded symbols for /lib/libnss_files.so.2
> > > > #0  0x0805ea52 in BlastNtWordFinder (search=0x84363e8,
> > > > lookup=0x842e6b8)
> > > >     at blast.c:9265
> > > > 9265			 next_lindex = (((lookup_index) &
> > > > mask)<<char_size) + *(s+1);
> >
> > Ok.  This is part of the word search section of BLAST.  Basically it
> > walks along the linear array looking for a match.  This should not fail,
> > though if it does, then the likely problem is in  *(s+1).  You could
> > translate *(s+1) as "the contents of the location pointed to by pointer
> > s incremented by one sizeof data type".  If s points to a valid
> > location, but s+1 does not, it is possible that the memory allocation
> > somehow failed to allocate sufficient memory for the array (unlikely,
> > you would have seen this elsewhere).  It is also possible that there is
> > some OS imposed boundary between the values of s and s+1 (the pointers
> > that is, not their contents), and by accessing the contents
> > (dereferencing) the pointer as BLAST was doing, you happened to trigger
> > the protection fault (which is what SEGV is).
> >
> > For some reason, the OS thinks that *(s+1) is owned by someone else.
> >
> > > > (gdb) backtrace
> > > > #0  0x0805ea52 in BlastNtWordFinder (search=0x84363e8,
> > > > lookup=0x842e6b8)
> > > >     at blast.c:9265
> > > > #1  0x0805a473 in BlastWordFinder (search=0x84363e8) at blast.c:6847
> > > > #2  0x0805a336 in BlastExtendWordSearch (search=0x84363e8,
> > > >     multiple_hits=0 '\0') at blast.c:6803
> > > > #3  0x08059d7c in BLASTPerformFinalSearch (search=0x84363e8,
> > > >     subject_length=117793,
> > > >     subject_seq=0x7e12b129 <Address 0x7e12b129 out of bounds>) at
> > > > blast.c:6612
> >
> > Yup.  Looks like memory somehow got mangled. You might have a look at
> > using ddd (graphical frontend to gdb), and do the run.  Then we can look
> > through the process a bit easier.  Basically run the system completely
> > from the debugger, and see where it crashes, and then poke at it as to
> > why.
> >
> > Note:  The location of the crash should not change by running it in the
> > debugger.  If it does, we might start to think more of a hardware
> > problem (bad swap, bad memory chip, etc) than of a program/OS bug.
> >
> > > > #4  0x080596c8 in BLASTPerformSearch (search=0x84363e8,
> > > > subject_length=117793,
> > > >     subject_seq=0x7e12b129 <Address 0x7e12b129 out of bounds>) at
> > > > blast.c:6365
> > > > #5  0x0805967b in BLASTPerformSearchWithReadDb (search=0x84363e8,
> > > >     sequence_number=1629625) at blast.c:6344
> > > > #6  0x0805066f in do_blast_search (ptr=0x84363e8) at blast.c:3335
> > > > #7  0x0804d600 in NlmThreadWrapper (wrapper_arg=0x8439c80) at
> > > > ncbithr.c:647
> > > > #8  0x400522b6 in start_thread () from /lib/tls/libpthread.so.0
> > > > (gdb) quit
> >
> > One more thought.  Do you get a crash with -a 1 (or no -a line)?  If
> > not, has your code been compiled on an NPTL box?  This has been a common
> > problem in using NPTL (in RH9) versus linuxthreads, and caused some
> > interesting crashes (though I seem to remember that they were not
> > segv's).
> >
> > Would you try some of my compiled 2.2.9 binaries or the ones from NCBI
> > and let us know if you still get the crash?  I am thinking this is a
> > problem in the OS interacting with the program, and not a program bug
> > per se.  If the problem persists across versions, and is repeatable, I
> > would like to get a copy of the input file which causes it.
> >
> > Joe
> >
> > --
> > Joseph Landman, Ph.D
> > Scalable Informatics LLC,
> > email: landman@scalableinformatics.com
> > web  : http://scalableinformatics.com
> > phone: +1 734 612 4615
> >
> > _______________________________________________
> > Bioclusters maillist  -  Bioclusters@bioinformatics.org
> > https://bioinformatics.org/mailman/listinfo/bioclusters
> >
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
-- 
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 612 4615