Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.

Bug 1858522

Summary: Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS
Product: [Fedora] Fedora Reporter: Antonio T. sagitter <trpost>
Component: hwlocAssignee: Jiri Hladky <hladky.jiri>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: dan, hannsj_uhl, hladky.jiri, mschmidt, orion, trpost
Target Milestone: ---   
Target Release: ---   
Hardware: s390x   
OS: Unspecified   
Whiteboard:
Fixed In Version: hwloc-2.2.0-1.fc33 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-04 02:09:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 467765, 1863077    

Description Antonio T. sagitter 2020-07-18 18:07:04 UTC
Description of problem:
OpenMPI tests of MUMPS are failing on Rawhide s390x only:

+ export OMPI_MCA_rmaps_base_oversubscribe=1
+ OMPI_MCA_rmaps_base_oversubscribe=1
+ ./ssimpletest
[buildvm-s390x-09:2509570] *** Process received signal ***
[buildvm-s390x-09:2509570] Signal: Segmentation fault (11)
[buildvm-s390x-09:2509570] Signal code: Address not mapped (1)
[buildvm-s390x-09:2509570] Failing at address: 0xfffffffffffff000
[buildvm-s390x-09:2509570] [ 0] [0x3fffdafcee0]
[buildvm-s390x-09:2509570] [ 1] /lib64/libhwloc.so.15(+0x44870)[0x3ff831c4870]
[buildvm-s390x-09:2509570] [ 2] /lib64/libhwloc.so.15(hwloc_topology_load+0xe6)[0x3ff83196ae6]
[buildvm-s390x-09:2509570] [ 3] /usr/lib64/openmpi/lib/libopen-pal.so.40(opal_hwloc_base_get_topology+0xfe2)[0x3ff836040d2]
[buildvm-s390x-09:2509570] [ 4] /usr/lib64/openmpi/lib/openmpi/mca_ess_hnp.so(+0x508c)[0x3ff82a0508c]
[buildvm-s390x-09:2509570] [ 5] /usr/lib64/openmpi/lib/libopen-rte.so.40(orte_init+0x2d2)[0x3ff83a112d2]
[buildvm-s390x-09:2509570] [ 6] /usr/lib64/openmpi/lib/libopen-rte.so.40(orte_daemon+0x26a)[0x3ff839bc72a]
[buildvm-s390x-09:2509570] [ 7] /lib64/libc.so.6(__libc_start_main+0x10a)[0x3ff836abb7a]
[buildvm-s390x-09:2509570] [ 8] orted(+0x954)[0x2aa11300954]
[buildvm-s390x-09:2509570] *** End of error message ***
[buildvm-s390x-09.s390.fedoraproject.org:2509569] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 716
[buildvm-s390x-09.s390.fedoraproject.org:2509569] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 172
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Unable to start a daemon on the local node" (-127) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[buildvm-s390x-09.s390.fedoraproject.org:2509569] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Version-Release number of selected component (if applicable):
MUMPS-5.3.1-3
openmpi-4.0.4-1

How reproducible:
Building MUMPS on Rawhide

Actual results:
https://koji.fedoraproject.org/koji/taskinfo?taskID=47387705

Comment 1 Orion Poplawski 2020-07-18 20:04:29 UTC
This looks to be hwloc related.  I'd like to see if updating to 2.2.0 resolves it.  I've filed https://src.fedoraproject.org/rpms/hwloc/pull-request/2

Comment 2 Orion Poplawski 2020-08-04 02:09:42 UTC
Hopefully fixed with hwloc-2.2.0-1.fc33.  Reopen if it does not.