[Bioclusters] Re: new on using clusters: problem running mpiblast (2)

Joe Landman landman at scalableinformatics.com
Sun Sep 23 11:42:44 EDT 2007


Hi Zhiliang

Zhiliang Hu wrote:

> which gives different errors:
> ----------------------
> [host.ansci.iastate.edu:07014] mca: base: component_find: unable to open 
> ras tm: file not found (ignored)
> [host.ansci.iastate.edu:07014] mca: base: component_find: unable to open 
> pls tm: file not found (ignored)

You wouldn't have torque installed on this machine perchance ...

tm is the torque launcher for OpenMPI.  I get the sense that your 
OpenMPI build (or Torque install) may not be working, as it is trying to 
launch with Torque.

Notice that each node is complaining that it cannot find the Torque 
launcher.

> [node001:24985] mca: base: component_find: unable to open ras tm: file 
> not found (ignored)
> [node001:24985] mca: base: component_find: unable to open pls tm: file 
> not found (ignored)
> [node003:15979] mca: base: component_find: unable to open ras tm: file 
> not found (ignored)
> [node002:19464] mca: base: component_find: unable to open ras tm: file 
> not found (ignored)
> [node003:15979] mca: base: component_find: unable to open pls tm: file 
> not found (ignored)
> [node002:19464] mca: base: component_find: unable to open pls tm: file 
> not found (ignored)

Each node cannot find the appropriate launcher, so ...

> 1       0.0736248       Bailing out with signal 11

... it appears to die ...

> [node002:19464] MPI_ABORT invoked on rank 1 in communicator 
> MPI_COMM_WORLD with errorcode 0
> 0       0.0795131       Bailing out with signal 15
> [node001:24985] MPI_ABORT invoked on rank 0 in communicator 
> MPI_COMM_WORLD with errorcode 0
> 2       0.0794392       Bailing out with signal 15
> [node003:15979] MPI_ABORT invoked on rank 2 in communicator 
> MPI_COMM_WORLD with errorcode 0

... on each node.

> ----------------------
> 
> By the way, from the head node, 'ssh node001 which orted' does not
> find it but 'ssh node001 whereis orted' found it (from both mpi 
> installations).  Also, after I do 'ssh node001', both 'which' and
> whereis' can find it from the two mpi installations.
> 
> I do have '/opt/openmpi121.gcc/bin' and '/opt/openmpi.gcc/bin' on
> my path (I am using bash; I tried using 'tcsh' with more errors).
> 
> I hope this provide more useful clue to dig further?

I am of the opinion now that this is a strictly MPI stack problem, and 
has little to do with mpiblast issues.

Did you build the OpenMPI, or did someone else?  Could you try with a 
different MPI stack?  Could you rebuild OpenMPI without the Torque 
launcher/support?

We have built OpenMPI for a number of customers without problems, though 
in some cases, due to incomplete/broken installation of prerequisites, 
we were not able to use all its features.  That said, we also have 
experienced some serious (show stopping) OpenMPI problems with numerous 
codes (Overflow, etc) that do not manifest themselves with different MPI 
stacks (mvapich2).

I would suggest trying a different MPI stack if at all possible, and 
specifically disabling using the TM launcher in OpenMPI.


Joe
> 
> Zhiliang
> 
> 
> On Thu, 20 Sep 2007, Joe Landman wrote:
> 
>> Date: Thu, 20 Sep 2007 16:13:50 -0400
>> From: Joe Landman <landman at scalableinformatics.com>
>> To: HPC in Bioinformatics <bioclusters at bioinformatics.org>
>> Subject: Re: [Bioclusters] Re: new on using clusters: problem running 
>> mpiblast
>>      (2)
>>
>> Zhiliang Hu wrote:
>>
>>> ---------------------------------------
>>> bash: orted: command not found
>>> bash: orted: command not found
>>
>>
>> Ah-hah!
>>
>> Could you do a
>>
>>     which orted
>>
>> on the head node from where you launch the mpiblast, and then
>>
>>     ssh node001 which orted
>>
>> and report that back?
>>
>>> [ansci.iastate.edu:03916] ERROR: A daemon on node node001 failed to
>>> start as expected.
>>
>> This suggests that a) orted wasn't found, and b) since that is required
>> to let OpenMPI set up the remote process, the remote process doesn't get
>> started.
>>
>>> [ansci.iastate.edu:03916] ERROR: There may be more information available
>>> from
>>> [ansci.iastate.edu:03916] ERROR: the remote shell (see above).
>>> [ansci.iastate.edu:03916] ERROR: The daemon exited unexpectedly with
>>> status 127.
>>
>> If you don't see orted on the remote system, you might need to contact
>> your systems administrator to make sure the right path is mounted on the
>> remote node.
>>
>> If you built OpenMPI yourself, you need to make sure your path variable
>> includes the $openmpi/bin  directory.
>>
>> Basically this looks like OpenMPI is not in your path, which is why it
>> can't find orted, and this is why mpiblast isn't booting up on the node.
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters


-- 

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615



More information about the Bioclusters mailing list