Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1858522 - Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS
Summary: Returned value Unable to start a daemon on the local node (-127) instead of O...
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: hwloc
Version: rawhide
Hardware: s390x
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Jiri Hladky
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: ZedoraTracker 1863077
TreeView+ depends on / blocked
 
Reported: 2020-07-18 18:07 UTC by Antonio T. sagitter
Modified: 2020-08-04 02:09 UTC (History)
6 users (show)

Fixed In Version: hwloc-2.2.0-1.fc33
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-04 02:09:42 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Antonio T. sagitter 2020-07-18 18:07:04 UTC
Description of problem:
OpenMPI tests of MUMPS are failing on Rawhide s390x only:

+ export OMPI_MCA_rmaps_base_oversubscribe=1
+ OMPI_MCA_rmaps_base_oversubscribe=1
+ ./ssimpletest
[buildvm-s390x-09:2509570] *** Process received signal ***
[buildvm-s390x-09:2509570] Signal: Segmentation fault (11)
[buildvm-s390x-09:2509570] Signal code: Address not mapped (1)
[buildvm-s390x-09:2509570] Failing at address: 0xfffffffffffff000
[buildvm-s390x-09:2509570] [ 0] [0x3fffdafcee0]
[buildvm-s390x-09:2509570] [ 1] /lib64/libhwloc.so.15(+0x44870)[0x3ff831c4870]
[buildvm-s390x-09:2509570] [ 2] /lib64/libhwloc.so.15(hwloc_topology_load+0xe6)[0x3ff83196ae6]
[buildvm-s390x-09:2509570] [ 3] /usr/lib64/openmpi/lib/libopen-pal.so.40(opal_hwloc_base_get_topology+0xfe2)[0x3ff836040d2]
[buildvm-s390x-09:2509570] [ 4] /usr/lib64/openmpi/lib/openmpi/mca_ess_hnp.so(+0x508c)[0x3ff82a0508c]
[buildvm-s390x-09:2509570] [ 5] /usr/lib64/openmpi/lib/libopen-rte.so.40(orte_init+0x2d2)[0x3ff83a112d2]
[buildvm-s390x-09:2509570] [ 6] /usr/lib64/openmpi/lib/libopen-rte.so.40(orte_daemon+0x26a)[0x3ff839bc72a]
[buildvm-s390x-09:2509570] [ 7] /lib64/libc.so.6(__libc_start_main+0x10a)[0x3ff836abb7a]
[buildvm-s390x-09:2509570] [ 8] orted(+0x954)[0x2aa11300954]
[buildvm-s390x-09:2509570] *** End of error message ***
[buildvm-s390x-09.s390.fedoraproject.org:2509569] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 716
[buildvm-s390x-09.s390.fedoraproject.org:2509569] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 172
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Unable to start a daemon on the local node" (-127) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[buildvm-s390x-09.s390.fedoraproject.org:2509569] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Version-Release number of selected component (if applicable):
MUMPS-5.3.1-3
openmpi-4.0.4-1

How reproducible:
Building MUMPS on Rawhide

Actual results:
https://koji.fedoraproject.org/koji/taskinfo?taskID=47387705

Comment 1 Orion Poplawski 2020-07-18 20:04:29 UTC
This looks to be hwloc related.  I'd like to see if updating to 2.2.0 resolves it.  I've filed https://src.fedoraproject.org/rpms/hwloc/pull-request/2

Comment 2 Orion Poplawski 2020-08-04 02:09:42 UTC
Hopefully fixed with hwloc-2.2.0-1.fc33.  Reopen if it does not.


Note You need to log in before you can comment on or make changes to this bug.