Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1619074 - openblas: s390x segfaults with fflas-ffpack testsuite
Summary: openblas: s390x segfaults with fflas-ffpack testsuite
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: openblas
Version: 31
Hardware: s390x
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Susi Lehtola
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: ZedoraTracker 1506952 1618946
TreeView+ depends on / blocked
 
Reported: 2018-08-20 02:14 UTC by Jerry James
Modified: 2020-04-06 08:50 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-06 08:50:24 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
fflas-ffpack spec file that uses openblas instead of atlas (7.28 KB, text/plain)
2018-08-20 02:14 UTC, Jerry James
no flags Details
test suite log file (1.27 MB, text/plain)
2018-08-24 13:37 UTC, Dan Horák
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github xianyi OpenBLAS issues 1743 0 None closed Incorrect results on s390x (-march=zEC12 -mtune=z13, ZARCH_GENERIC) 2020-07-21 19:14:46 UTC

Description Jerry James 2018-08-20 02:14:31 UTC
Created attachment 1477022 [details]
fflas-ffpack spec file that uses openblas instead of atlas

Description of problem:
I have been asked to switch packages I maintain from atlas to openblas.  However, I ran into trouble with the fflas-ffpack package.  The testsuite segfaults on s390x.  See https://koji.fedoraproject.org/koji/taskinfo?taskID=29190683 for a scratch build demonstrating the problem.  All tests pass on other architectures, and they pass on s390x with both atlas and the reference blas implementation.

Version-Release number of selected component (if applicable):
openblas-0.3.2-2.fc30

How reproducible:
Always

Steps to Reproduce:
1. fedpkg clone fflas-ffpack
2. Replace fflas-ffpack.spec with the attached version, which uses openblas
3. Build for s390x

Actual results:
Segfaults in the testsuite

Expected results:
Passing testsuite

Additional info:

Comment 1 Dominik 'Rathann' Mierzejewski 2018-08-20 10:16:10 UTC
Recommend following https://fedoraproject.org/wiki/Architectures/s390x#Shell_access_for_debugging to obtain the core file.

Comment 2 Dominik 'Rathann' Mierzejewski 2018-08-20 10:20:33 UTC
https://fedoraproject.org/wiki/Architectures/s390x#Notes_for_application_developers_and_package_maintainers suggests increasing stack size to fix SIGSEGV when running pcre tests. Indeed, the workaround is still there: https://src.fedoraproject.org/rpms/pcre/blob/master/f/pcre.spec#_170

It's worth a try.

Comment 3 Dan Horák 2018-08-20 15:53:11 UTC
Our public s390x guest is out of service at moment, but it would be interesting to see what would happen when openblas-0.3.2-1.fc29 would be used (it has incorrectly used a z13 based kernel). Adding to my to-do list ...

Comment 4 Jerry James 2018-08-21 03:36:32 UTC
(In reply to Dominik 'Rathann' Mierzejewski from comment #2)
> https://fedoraproject.org/wiki/Architectures/
> s390x#Notes_for_application_developers_and_package_maintainers suggests
> increasing stack size to fix SIGSEGV when running pcre tests. Indeed, the
> workaround is still there:
> https://src.fedoraproject.org/rpms/pcre/blob/master/f/pcre.spec#_170
> 
> It's worth a try.

Sadly, no, increasing the stack limit did not change the outcome:

https://koji.fedoraproject.org/koji/taskinfo?taskID=29209212

Comment 5 Jerry James 2018-08-23 20:03:00 UTC
Building locally with mock --forcearch s390x, I see this for one of the test programs that dumped core:

[mockbuild@3dd16e3c6a964f7f8f24f6d548e7b042 tests]$ ./test-ftrsm
Checking with Modular<double> mod 523427
terminate called after throwing an instance of 'FailureTrsmCheck'
Aborted (core dumped)

So it isn't a segfault; it's an abort because something computed the wrong value and nothing caught the resulting exception.  Sorry for erroneously calling the failure a segfault.

Looking through the test results, I see several test failures.  Some result in a core file and some don't.  The bottom line is that openblas is computing values that the fflas-ffpack test suite considers incorrect.  I need to see what the nonmatching values are to diagnose the problem.

Sadly, I have now hit the limits of mock --forcearch:

(gdb) run
Starting program: /builddir/build/BUILD/fflas_ffpack-2.3.2/tests/test-ftrsm 
qemu: Unsupported syscall: 26
warning: Could not trace the inferior process.
Error: 
warning: ptrace: Function not implemented
During startup program exited with code 127.

Dan, is there any chance that public s390x guest might come back?  If not, I would appreciate any help those with access to s390x hardware can give.

Footnotes:
[1] Almost.  With ATLAS, I had to disable the test-lu and test-echelon tests on ppc64 and ppc64le, because of bug 1410633.  With openblas, those tests pass on ppc64 and ppc64le, but the following tests fail on s390x:
- test-ftrtri
- test-ftrmv
- test-ftrsm (aborts)
- test-ftrsm-check (aborts)
- test-ftrmm
- test-pluq-check (aborts)
- test-fsytrf
- test-invert-check (aborts)
- test-det-check
- test-echelon
So to get passing tests, I should build with openblas on all arches except s390x, and use atlas on s390x.  But that's ugly and horrible and I don't want to do it if there is any chance at all that the problem with openblas + s390x can be identified and fixed.

Comment 6 Jerry James 2018-08-23 21:57:48 UTC
I commented out the floating point tests for the failing test programs, and sure enough, the integer tests pass.

I thought I would try rebuilding openblas with -ffloat-store or maybe -ffp-contract=off, so I grabbed the SRPM, inserted that everywhere that %{optflags} appears and kicked off an s390x mock build.

I am seeing a large number of files compiled without %{optflags}.  This is probably due to lines 394 through 400 of the spec file:

%if 0%{?rhel} == 5
# Gfortran too old to recognize -frecursive
COMMON="%{optflags} -fPIC"
FCOMMON="%{optflags} -fPIC"
%else
FCOMMON="%{optflags} -fPIC -frecursive"
%endif

Notice the lack of a COMMON definition in the second case.  Could that cause this issue?

Comment 7 Dan Horák 2018-08-24 13:37:02 UTC
Created attachment 1478543 [details]
test suite log file

And this is the info from some of the aborts.

[sharkcz@devel10 fflas-ffpack]$ coredumpctl info 45560
           PID: 45560 (test-invert-che)
           UID: 1000 (sharkcz)
           GID: 1012 (sharkcz)
        Signal: 6 (ABRT)
     Timestamp: Fri 2018-08-24 09:15:27 EDT (13min ago)
  Command Line: ./test-invert-check
    Executable: /home/sharkcz/fflas-ffpack/fflas_ffpack-2.3.2/tests/test-invert-check
 Control Group: /user.slice/user-1000.slice/session-59.scope
          Unit: session-59.scope
         Slice: user-1000.slice
       Session: 59
     Owner UID: 1000 (sharkcz)
       Boot ID: cad3ea6c02cb4ef7aa5c17cbc3bae66f
    Machine ID: 9f494311b8fe4625a05e6f0acd9c4b3f
      Hostname: devel10.s390.bos.redhat.com
       Storage: /var/lib/systemd/coredump/core.test-invert-che.1000.cad3ea6c02cb4ef7aa5c17cbc3bae66f.45560.1535116527000000.lz4
       Message: Process 45560 (test-invert-che) of user 1000 dumped core.
                
                Stack trace of thread 45560:
                #0  0x0000020032cbe454 raise (libc.so.6)
                #1  0x0000020032ca3ce8 abort (libc.so.6)
                #2  0x00000200328ab150 _ZN9__gnu_cxx27__verbose_terminate_handlerEv (libstdc++.so.6)
                #3  0x00000200328a8a5e n/a (libstdc++.so.6)
                #4  0x00000200328a8ac0 _ZSt9terminatev (libstdc++.so.6)
                #5  0x00000200328a8d96 __cxa_throw (libstdc++.so.6)
                #6  0x000002aa284491d2 _ZNK6FFPACK18CheckerImplem_PLUQIN6Givaro7ModularIddEEE5checkEPKdmN5FFLAS10FFLAS_DIAGEmPmS9_ (test-invert-check)
                #7  0x000002aa2844967e _ZN6FFPACK9Protected11GaussJordanIN6Givaro7ModularIddEEEEmRKT_mmNS5_11Element_ptrEmmmmPmS9_NS_13FFPACK_LU_TAGE (test-invert-check)
                #8  0x000002aa2844a396 _ZN6FFPACK21ReducedRowEchelonFormIN6Givaro7ModularIddEEEEmRKT_mmNS4_11Element_ptrEmPmS8_bNS_13FFPACK_LU_TAGE (test-invert-check)
                #9  0x000002aa28409e84 main (test-invert-check)
                #10 0x0000020032ca4172 __libc_start_main (libc.so.6)
                #11 0x000002aa2840a204 _start (test-invert-check)


[sharkcz@devel10 fflas-ffpack]$ coredumpctl info 45367
           PID: 45367 (test-pluq-check)
           UID: 1000 (sharkcz)
           GID: 1012 (sharkcz)
        Signal: 6 (ABRT)
     Timestamp: Fri 2018-08-24 09:14:30 EDT (20min ago)
  Command Line: ./test-pluq-check
    Executable: /home/sharkcz/fflas-ffpack/fflas_ffpack-2.3.2/tests/test-pluq-check
 Control Group: /user.slice/user-1000.slice/session-59.scope
          Unit: session-59.scope
         Slice: user-1000.slice
       Session: 59
     Owner UID: 1000 (sharkcz)
       Boot ID: cad3ea6c02cb4ef7aa5c17cbc3bae66f
    Machine ID: 9f494311b8fe4625a05e6f0acd9c4b3f
      Hostname: devel10.s390.bos.redhat.com
       Storage: /var/lib/systemd/coredump/core.test-pluq-check.1000.cad3ea6c02cb4ef7aa5c17cbc3bae66f.45367.1535116470000000.lz4 (inaccessible)
       Message: Process 45367 (test-pluq-check) of user 1000 dumped core.
                
                Stack trace of thread 45367:
                #0  0x00000200083be454 raise (libc.so.6)
                #1  0x00000200083a3ce8 abort (libc.so.6)
                #2  0x0000020007fab150 _ZN9__gnu_cxx27__verbose_terminate_handlerEv (libstdc++.so.6)
                #3  0x0000020007fa8a5e n/a (libstdc++.so.6)
                #4  0x0000020007fa8ac0 _ZSt9terminatev (libstdc++.so.6)
                #5  0x0000020007fa8d96 __cxa_throw (libstdc++.so.6)
                #6  0x000002aa0a84542e _ZN5FFLAS19CheckerImplem_ftrsmIN6Givaro7ModularIddEEE5checkENS_10FFLAS_SIDEENS_10FFLAS_UPLOENS_15FFLAS_TRANSPOSEENS_10FFLAS_DIAGEmmPKdmSA_m (test-pluq>
                #7  0x000002aa0a845712 _ZN6FFPACK5_PLUQIN6Givaro7ModularIddEEEEmRKT_N5FFLAS10FFLAS_DIAGEmmNS4_11Element_ptrEmPmSA_m (test-pluq-check)
                #8  0x000002aa0a809994 _ZN6FFPACK4PLUQIN6Givaro7ModularIddEEEEmRKT_N5FFLAS10FFLAS_DIAGEmmNS4_11Element_ptrEmPmSA_m (test-pluq-check)
                #9  0x00000200083a4172 __libc_start_main (libc.so.6)
                #10 0x000002aa0a80ab94 _start (test-pluq-check)
                
                Stack trace of thread 45370:
                #0  0x000002000821b2f8 n/a (libgomp.so.1)
                #1  0x00000200082188e2 n/a (libgomp.so.1)
                #2  0x00000200083080fe start_thread (libpthread.so.0)
                #3  0x0000020008479f96 thread_start (libc.so.6)
                
                Stack trace of thread 45368:
                #0  0x000002000821b2f8 n/a (libgomp.so.1)
                #1  0x00000200082188e2 n/a (libgomp.so.1)
                #2  0x00000200083080fe start_thread (libpthread.so.0)
                #3  0x0000020008479f96 thread_start (libc.so.6)
                
                Stack trace of thread 45369:
                #0  0x000002000821b2f8 n/a (libgomp.so.1)
                #1  0x00000200082188e2 n/a (libgomp.so.1)
                #2  0x00000200083080fe start_thread (libpthread.so.0)
                #3  0x0000020008479f96 thread_start (libc.so.6)

Comment 8 Dan Horák 2018-08-24 13:38:50 UTC
(In reply to Jerry James from comment #5)
> 
> Dan, is there any chance that public s390x guest might come back?  If not, I
> would appreciate any help those with access to s390x hardware can give.

The plan from the Marist people is/was to have the hypervisor ready again today, so there is hope it won't take long to have the public guest back.

Comment 9 Dan Horák 2018-08-24 15:30:58 UTC
backtrace from gdb for the "45560" abort

(gdb) where
#0  0x0000020032cbe454 in raise () from /lib64/libc.so.6
#1  0x0000020032ca3ce8 in abort () from /lib64/libc.so.6
#2  0x00000200328ab150 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#3  0x00000200328a8a5e in ?? () from /lib64/libstdc++.so.6
#4  0x00000200328a8ac0 in std::terminate() () from /lib64/libstdc++.so.6
#5  0x00000200328a8d96 in __cxa_throw () from /lib64/libstdc++.so.6
#6  0x000002aa284491d2 in FFPACK::CheckerImplem_PLUQ<Givaro::Modular<double, double> >::check (Q=0x2aa605c3920, P=0x2aa605c3330, r=189, Diag=FFLAS::FflasUnit, lda=378, A=0x2003309a010, 
    this=<synthetic pointer>) at ../fflas-ffpack/utils/fflas_memory.h:90
#7  FFPACK::PLUQ<Givaro::Modular<double, double> > (BCThreshold=256, Q=0x2aa605c3920, P=0x2aa605c3330, lda=<optimized out>, A=0x2003309a010, N=189, M=189, Diag=FFLAS::FflasUnit, Fi=...)
    at ../fflas-ffpack/ffpack/ffpack_pluq.inl:662
#8  FFPACK::RowEchelonForm<Givaro::Modular<double, double> > (LuTag=FFPACK::FfpackTileRecursive, transform=true, Qt=0x2aa605c3920, P=0x2aa605c3330, lda=<optimized out>, A=0x2003309a010, 
    N=189, M=189, F=...) at ../fflas-ffpack/ffpack/ffpack_echelonforms.inl:67
#9  FFPACK::ReducedRowEchelonForm<Givaro::Modular<double, double> > (F=..., M=189, N=189, A=0x2003309a010, lda=<optimized out>, P=0x2aa605c3330, Qt=0x2aa605c3920, transform=true, 
    LuTag=FFPACK::FfpackTileRecursive) at ../fflas-ffpack/ffpack/ffpack_echelonforms.inl:121
#10 0x000002aa2844967e in FFPACK::Protected::GaussJordan<Givaro::Modular<double, double> > (F=..., M=189, N=189, A=0x2003309a010, lda=378, colbeg=0, rowbeg=0, colsize=189, P=0x2aa605c3330, 
    Q=0x2aa605c3920, LuTag=FFPACK::FfpackGaussJordanTile) at ../fflas-ffpack/ffpack/ffpack_echelonforms.inl:144
#11 0x000002aa2844a396 in FFPACK::ReducedRowEchelonForm<Givaro::Modular<double, double> > (LuTag=FFPACK::FfpackGaussJordanTile, transform=true, Qt=0x2aa605c3920, P=0x2aa605c3330, lda=378, 
    A=0x2003309a010, N=<optimized out>, M=<optimized out>, F=...) at ../fflas-ffpack/ffpack/ffpack_echelonforms.inl:111
#12 FFPACK::Invert<Givaro::Modular<double, double> > (F=..., M=<optimized out>, A=0x2003309a010, lda=378, nullity=@0x3ffcb87dfb4: 833249088) at ../fflas-ffpack/ffpack/ffpack_invert.inl:51
#13 0x000002aa28409e84 in main (argc=<optimized out>, argv=<optimized out>) at test-invert-check.C:80

Comment 10 Susi Lehtola 2018-08-24 16:29:14 UTC
(In reply to Jerry James from comment #6)
> I am seeing a large number of files compiled without %{optflags}.  This is
> probably due to lines 394 through 400 of the spec file:
> 
> %if 0%{?rhel} == 5
> # Gfortran too old to recognize -frecursive
> COMMON="%{optflags} -fPIC"
> FCOMMON="%{optflags} -fPIC"
> %else
> FCOMMON="%{optflags} -fPIC -frecursive"
> %endif
> 
> Notice the lack of a COMMON definition in the second case.  Could that cause
> this issue?

Good catch.

Comment 11 Jerry James 2018-08-27 19:22:07 UTC
Unfortunately, it didn't cause this issue.  So here is what I am noticing now.  Take a look in build.log for the latest build.  On s390x only, no other architecture, the test suite reports test failures, over 100 failures, in fact.  So there are two bugs here:
(1) test failures don't cause %check to fail the build; and
(2) tests are failing on s390x.

Here is an example test failure:
 SGEMM  PASSED THE TESTS OF ERROR-EXITS
 SGEMM  PASSED THE COMPUTATIONAL TESTS ( 17496 CALLS)
 SSYMM  PASSED THE TESTS OF ERROR-EXITS
 SSYMM  PASSED THE COMPUTATIONAL TESTS (  1296 CALLS)
 STRMM  PASSED THE TESTS OF ERROR-EXITS
 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
           EXPECTED RESULT   COMPUTED RESULT
       1      0.186813          0.373626    
 ******* STRMM  FAILED ON CALL NUMBER:
    506: STRMM ('L','U','N','U',  1,  1, 1.0, A,  2, B,  2)        .

This strongly suggests to me that the fflas-ffpack test suite is right: openblas is computing incorrect results on s390x.  Looking through the failures, I see something interesting: the computed result is exactly two times the expected result in every failure I have looked at so far.  There is probably an off-by-one bit shift error somewhere in the s390x support code.

Might I also suggest the use of %ldconfig_scriptlets in place of explicit invocations of ldconfig?

Comment 12 Jerry James 2018-08-27 19:30:57 UTC
Sorry, I was imprecise: I'm talking about the openblas build.log, and test failures in the openblas test suite.

Comment 13 Jerry James 2018-08-27 19:44:55 UTC
I also notice a lot of warnings like this:

BUILDSTDERR: xerbla.c: In function 'cblas_xerbla':
BUILDSTDERR: xerbla.c:16:35: warning: format '%d' expects argument of type 'int', but argument 3 has type 'blasint' {aka 'long int'} [-Wformat=]
BUILDSTDERR:        fprintf(stderr, "Parameter %d to routine %s was incorrect\n", p, rout);
BUILDSTDERR:                                   ~^                                 ~
BUILDSTDERR:                                   %ld

That means that fprintf is only accessing 32-bits of the 64-bit value passed to it.  On little endian architectures, you can often get away with this, as the upper 32 bits are often zero, and you fortuitously get the lower 32 bits.  On a big endian architecture like s390x, though, you get the upper 32 bits (which are often zero).  For error messages, maybe we don't care, but there may be non-error messages in the code base where this does matter.  These warnings should be fixed (e.g., by specifying %ld and casting the argument to long, in case blasint is shorter than a long on some architectures.)

Comment 14 Susi Lehtola 2018-08-27 20:39:25 UTC
(In reply to Jerry James from comment #11)
> Unfortunately, it didn't cause this issue.  So here is what I am noticing
> now.  Take a look in build.log for the latest build.  On s390x only, no
> other architecture, the test suite reports test failures, over 100 failures,
> in fact.  So there are two bugs here:
> (1) test failures don't cause %check to fail the build; and
> (2) tests are failing on s390x.

Yay...

Would you mind reporting the issues to OpenBLAS upstream? You appear to know much more about the problem than I.

Comment 15 Jerry James 2018-08-28 02:29:28 UTC
Reported upstream.  I don't speak Fortran or s390x assembly, and I don't have access to any real s390x systems at the moment, so I probably won't be much help debugging this.

Comment 16 Dan Horák 2018-08-28 06:25:07 UTC
(In reply to Dan Horák from comment #8)
> (In reply to Jerry James from comment #5)
> > 
> > Dan, is there any chance that public s390x guest might come back?  If not, I
> > would appreciate any help those with access to s390x hardware can give.
> 
> The plan from the Marist people is/was to have the hypervisor ready again
> today, so there is hope it won't take long to have the public guest back.

And it is up again. Beware it's a z13 machine, so openblas needs to be built with TARGET= to enable the generic backend to match the HW Fedora supports (zEC12 and newer)

Comment 17 Susi Lehtola 2018-08-28 10:42:11 UTC
(In reply to Dan Horák from comment #16)
> And it is up again. Beware it's a z13 machine, so openblas needs to be built
> with TARGET= to enable the generic backend to match the HW Fedora supports
> (zEC12 and newer)

That's already been done.

Comment 18 Dan Horák 2018-08-28 11:13:26 UTC
(In reply to Susi Lehtola from comment #17)
> (In reply to Dan Horák from comment #16)
> > And it is up again. Beware it's a z13 machine, so openblas needs to be built
> > with TARGET= to enable the generic backend to match the HW Fedora supports
> > (zEC12 and newer)
> 
> That's already been done.

right, but it's needed when building openblas from sources directly on the public guest

Comment 19 Dominik 'Rathann' Mierzejewski 2018-08-29 20:12:40 UTC
It looks like it's fixed by https://github.com/martin-frbg/OpenBLAS/commit/f3fd44a731c1997b1d79d4d16abc25d78dce88a7 and the fix will be included in 0.3.3.

Comment 20 Susi Lehtola 2018-08-29 20:19:53 UTC
(In reply to Dominik 'Rathann' Mierzejewski from comment #19)
> It looks like it's fixed by
> https://github.com/martin-frbg/OpenBLAS/commit/
> f3fd44a731c1997b1d79d4d16abc25d78dce88a7 and the fix will be included in
> 0.3.3.

Dan's already building fixed packages
https://koji.fedoraproject.org/koji/buildinfo?buildID=1140405

Comment 21 Dan Horák 2018-08-29 20:26:14 UTC
I'm building openblas-0.3.2-5.fc30 that includes the fix right now. We should know better in a while, if it will fix some of the issues appearing on s390x.

Comment 22 Dan Horák 2018-08-30 07:27:16 UTC
And all fflas-ffpack tests pass with the new openblas build is in the buildroot. Going to test some other packages too.

Comment 23 Dan Horák 2018-09-27 17:11:17 UTC
with current rawhide buildroot I see only
FAIL: test-fgemm on aarch64
when using the spec file from attachment

https://koji.fedoraproject.org/koji/taskinfo?taskID=29919285

Comment 24 Ben Cotton 2019-08-13 16:50:04 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to '31'.

Comment 25 Ben Cotton 2019-08-13 19:43:40 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to 31.

Comment 26 Dominik 'Rathann' Mierzejewski 2020-04-05 21:08:10 UTC
I guess this can be closed, Susi?

Comment 27 Susi Lehtola 2020-04-06 08:50:24 UTC
I think so.


Note You need to log in before you can comment on or make changes to this bug.