Aaron, Many thanks for your hints. Below please find what I get on trials of your hints: On Mon, 17 Sep 2007, Aaron Darling wrote: > Date: Mon, 17 Sep 2007 13:25:31 -0700 > From: Aaron Darling <darling at cs.wisc.edu> > Reply-To: HPC in Bioinformatics <bioclusters at bioinformatics.org> > To: HPC in Bioinformatics <bioclusters at bioinformatics.org> > Subject: Re: [Bioclusters] Re: new on using clusters: problem running mpiblast > (2) > > This seems to be quite an elusive problem... > The rank 1 process is crashing with signal 11, which is usually a > segmentation fault, indicating an invalid memory access. Assuming you > get the same behavior (no output) when running with --debug, it crashes > very early in the program, prior to writing any debug output. I can see > two ways to debug the problem on your cluster, both of which will > require some patience. The red pill would be running mpiblast in an mpi > debugger and see where the process crashes. I'm unsure how the openmpi > debugger works, but there should be some mechanism to attach to the rank > 1 process. > The blue pill involves running mpiblast with a few different command > line options to see how far along in the program it gets before > crashing. That might narrow down the crash point enough to give a clue > for the solving the problem. If you take the blue pill, run the > following mpiblast commands: > > mpiblast --version > (this prints the version and is the first thing the program does at > startup. if the program doesn't get that far then something is very wrong.) Yeap, it gives version 1.4.0. > mpiblast > (run with no arguments, this causes the program to exit before parsing > the command-line and print an error message. if the program doesn't get > that far then something is very wrong.) Indeed it responded with option suggestions: ------- mpiBLAST requires the following options: -d [database] -i [query file] -p [blast program name] > mpiblast -a blah -b blah -c blah blah blah > (run with bogus arguments. the program should exit with "mpiBLAST > requires the following options: -d [database] -i [query file] -p [blast > program name]". > This check happens after initializing the MPI libraries, so if you get > this error, then the mpi libs were init'ed successfully ) Same as above. > mpirun -np 2 -machinefile ./machines /home/local/bin/mpiblast -p blastp > -i ./bait.fasta -d ecoli.aa > (mpiblast should report that it needs to be run on at least three nodes) Here is the error -- not as you expected: --------------------------------------- bash: orted: command not found bash: orted: command not found [ansci.iastate.edu:03916] ERROR: A daemon on node node001 failed to start as expected. [ansci.iastate.edu:03916] ERROR: There may be more information available from [ansci.iastate.edu:03916] ERROR: the remote shell (see above). [ansci.iastate.edu:03916] ERROR: The daemon exited unexpectedly with status 127. [ansci.iastate.edu:03916] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [ansci.iastate.edu:03916] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1164 [ansci.iastate.edu:03916] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 [ansci.iastate.edu:03916] ERROR: A daemon on node node002 failed to start as expected. [ansci.iastate.edu:03916] ERROR: There may be more information available from [ansci.iastate.edu:03916] ERROR: the remote shell (see above). [ansci.iastate.edu:03916] ERROR: The daemon exited unexpectedly with status 127. [ansci.iastate.edu:03916] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188 [nagrp2.ansci.iastate.edu:03916] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1196 -------------------------------------------------------------------------- mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS. -------------------------------------------------------------------------- > mpiblast --copy-via=none -p blastp -i ./bait.fasta -d ecoli.aa > (this should exit with the error message "Error: Shared and Local > storage must be identical when --copy_via=none") Here is the error -- not as expected: -------------------------------------- Sorry, mpiBLAST must be run on 3 or more nodes [ansci.iastate.edu:04099] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 0 > mpiblast --pro-phile=asdfasdf --debug=logfile.txt -p blastp -i > ./bait.fasta -d ecoli.aa > (this should write out "WARNING: --pro-phile is no longer supported" and > "logging to logfile.txt") Here is the error -- not as expected: -------------------------------------- Sorry, mpiBLAST must be run on 3 or more nodes [ansci.iastate.edu:04102] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 0 > So, depending on how far through the list of commands you're able to get > error messages, we should be able to pin down where the program crashes. > Let me know how it goes. > > -aaron I noted my errors are a little different from you expected, so I repeated them, and made sure I copied the errors to the right trial commands. I hope these errors make some sense to you so you could come up more as what to try next... Best regards, Zhiliang