This seems to be quite an elusive problem... The rank 1 process is crashing with signal 11, which is usually a segmentation fault, indicating an invalid memory access. Assuming you get the same behavior (no output) when running with --debug, it crashes very early in the program, prior to writing any debug output. I can see two ways to debug the problem on your cluster, both of which will require some patience. The red pill would be running mpiblast in an mpi debugger and see where the process crashes. I'm unsure how the openmpi debugger works, but there should be some mechanism to attach to the rank 1 process. The blue pill involves running mpiblast with a few different command line options to see how far along in the program it gets before crashing. That might narrow down the crash point enough to give a clue for the solving the problem. If you take the blue pill, run the following mpiblast commands: mpiblast --version (this prints the version and is the first thing the program does at startup. if the program doesn't get that far then something is very wrong.) mpiblast (run with no arguments, this causes the program to exit before parsing the command-line and print an error message. if the program doesn't get that far then something is very wrong.) mpiblast -a blah -b blah -c blah blah blah (run with bogus arguments. the program should exit with "mpiBLAST requires the following options: -d [database] -i [query file] -p [blast program name]". This check happens after initializing the MPI libraries, so if you get this error, then the mpi libs were init'ed successfully ) mpirun -np 2 -machinefile ./machines /home/local/bin/mpiblast -p blastp -i ./bait.fasta -d ecoli.aa (mpiblast should report that it needs to be run on at least three nodes) mpiblast --copy-via=none -p blastp -i ./bait.fasta -d ecoli.aa (this should exit with the error message "Error: Shared and Local storage must be identical when --copy_via=none") mpiblast --pro-phile=asdfasdf --debug=logfile.txt -p blastp -i ./bait.fasta -d ecoli.aa (this should write out "WARNING: --pro-phile is no longer supported" and "logging to logfile.txt") So, depending on how far through the list of commands you're able to get error messages, we should be able to pin down where the program crashes. Let me know how it goes. -aaron Zhiliang Hu wrote: > Thanks Aaron, > > Indeed I got it compiled before (and now again, without my last > reported "CC/CPP" exports, and with or without non-specific "export > CC=mpicc" and "export CXX=mpicxx" suggested by Zhao Xu). > > The problem was, when I run the mpiblast with: > /opt/openmpi.gcc/bin/mpirun -np 16 -machinefile ./machines > /home/local/bin/mpiblast -p blastp -i ./bait.fasta -d ecoli.aa > > I got following error that I don't have a clue as where to look for > the cause: > > 1 0.095628 Bailing out with signal 11 > [node001:13406] MPI_ABORT invoked on rank 1 in communicator > MPI_COMM_WORLD with errorcode 0 > 0 0.101815 Bailing out with signal 15 > [node001:13405] MPI_ABORT invoked on rank 0 in communicator > MPI_COMM_WORLD with errorcode 0 > 15 0.157852 Bailing out with signal 15 > [node001:13420] MPI_ABORT invoked on rank 15 in communicator > MPI_COMM_WORLD with errorcode 0 > 2 0.105103 Bailing out with signal 15 > [node001:13407] MPI_ABORT invoked on rank 2 in communicator > MPI_COMM_WORLD with errorcode 0 > 3 0.109706 Bailing out with signal 15 > [node001:13408] MPI_ABORT invoked on rank 3 in communicator > MPI_COMM_WORLD with errorcode 0 > 4 0.114032 Bailing out with signal 15 > [node001:13409] MPI_ABORT invoked on rank 4 in communicator > MPI_COMM_WORLD with errorcode 0 > 5 0.117891 Bailing out with signal 15 > [node001:13410] MPI_ABORT invoked on rank 5 in communicator > MPI_COMM_WORLD with errorcode 0 > 6 0.122292 Bailing out with signal 15 > [node001:13411] MPI_ABORT invoked on rank 6 in communicator > MPI_COMM_WORLD with errorcode 0 > 7 0.125675 Bailing out with signal 15 > [node001:13412] MPI_ABORT invoked on rank 7 in communicator > MPI_COMM_WORLD with errorcode 0 > 8 0.129363 Bailing out with signal 15 > [node001:13413] MPI_ABORT invoked on rank 8 in communicator > MPI_COMM_WORLD with errorcode 0 > 9 0.134528 Bailing out with signal 15 > [node001:13414] MPI_ABORT invoked on rank 9 in communicator > MPI_COMM_WORLD with errorcode 0 > 10 0.138087 Bailing out with signal 15 > [node001:13415] MPI_ABORT invoked on rank 10 in communicator > MPI_COMM_WORLD with errorcode 0 > 11 0.141622 Bailing out with signal 15 > [node001:13416] MPI_ABORT invoked on rank 11 in communicator > MPI_COMM_WORLD with errorcode 0 > 12 0.145868 Bailing out with signal 15 > [node001:13417] MPI_ABORT invoked on rank 12 in communicator > MPI_COMM_WORLD with errorcode 0 > 13 0.149375 Bailing out with signal 15 > [node001:13418] MPI_ABORT invoked on rank 13 in communicator > MPI_COMM_WORLD with errorcode 0 > 14 0.152966 Bailing out with signal 15 > [node001:13419] MPI_ABORT invoked on rank 14 in communicator > MPI_COMM_WORLD with errorcode 0 > > [As related information, the mpirun is working fine when tested with a > small "hello" program that showed responses from all nodes]. > > -- > Zhiliang > > > On Sun, 9 Sep 2007, Aaron Darling wrote: > >> Date: Sun, 09 Sep 2007 08:04:14 +1000 >> From: Aaron Darling <darling at cs.wisc.edu> >> Reply-To: HPC in Bioinformatics <bioclusters at bioinformatics.org> >> To: HPC in Bioinformatics <bioclusters at bioinformatics.org> >> Subject: Re: [Bioclusters] Re: new on using clusters: problem running >> mpiblast >> (2) >> >> Hi Zhiliang >> >> For reasons that are beyond me, the version of autoconf that we used to >> package mpiBLAST 1.4.0 does not approve of setting CC and/or CXX to >> mpicc or mpicxx. Doing so results in the autoconf error you have >> observed. For that reason we added the --with-mpi=/path/to/mpi >> configure option. It should be sufficient to use that option alone to >> set the preferred compiler path. If not, then it's a bug in the >> mpiblast configure system. >> >> In response to your other query, I personally have not used mpiblast >> with OpenMPI but I believe others have. The 1.4.0 release was tested >> against mpich1/2 and LAM. >> >> Regards, >> -Aaron >> >> > _______________________________________________ > Bioclusters maillist - Bioclusters at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters