Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.

Bug 2247319

Summary: sporadic sigill with python3.12 in livemedia composes
Product: [Fedora] Fedora Reporter: Kevin Fenzi <kevin>
Component: libblockdevAssignee: Vojtech Trefny <vtrefny>
Status: NEW --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 39CC: awilliam, cstratak, dan, dennis, fweimer, mhroncok, python-packagers-sig, quantum.analyst, robatino, tbzatek, thrnciar, v.podzimek+fedora, vtrefny, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: aarch64   
OS: Linux   
Whiteboard: AcceptedFreezeException
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 245418, 2187795    
Attachments:
Description Flags
updated coredump with libblockdev-3.1.0-6.fc41
none
another updated coredump with libblockdev 3.1.1-4 (with -fsanitize=undefined) none

Description Kevin Fenzi 2023-10-31 20:32:31 UTC
We have been seeing this for a while, finally filing a bug on it to get it better tracked. :( 

Sporadically we are seeing livemedia composes in both f39 and rawhide fail.
anaconda reports that it couldn't connect to dbus.
On the builders there's a SIGILL from python3.12:

           PID: 1535555 (python3)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 4 (ILL)
     Timestamp: Tue 2023-10-31 08:00:49 UTC (12h ago)
  Command Line: python3 -m pyanaconda.modules.storage
    Executable: /usr/bin/python3.12
 Control Group: /machine.slice/machine-810eb7856fe6435d848985a92b99d9b9.scope/payload
          Unit: machine-810eb7856fe6435d848985a92b99d9b9.scope
         Slice: machine.slice
       Boot ID: 4a08a17dadbd421d9b3a274841c5ad95
    Machine ID: 56794a87462f49be8a21501b922b4dbd
      Hostname: 810eb7856fe6435d848985a92b99d9b9
       Storage: none
       Message: Process 1535555 (python3) of user 0 dumped core.

https://koji.fedoraproject.org/koji/taskinfo?taskID=108358818 is an example of a failed task.

This could be a anaconda bug, or a problem with our builders, but I thought I would start with python3.12.

Its only on aarch64 and only livemedia tasks.
Its sporadic, not the same builder or image all the time.
It's also not the same command line all the time.

     15   Command Line: python3 -m pyanaconda.modules.localization
     15   Command Line: python3 -m pyanaconda.modules.network
     13   Command Line: python3 -m pyanaconda.modules.payloads
      3   Command Line: python3 -m pyanaconda.modules.runtime
      5   Command Line: python3 -m pyanaconda.modules.security
      6   Command Line: python3 -m pyanaconda.modules.services
     12   Command Line: python3 -m pyanaconda.modules.storage
      4   Command Line: python3 -m pyanaconda.modules.subscription
      6   Command Line: python3 -m pyanaconda.modules.timezone
      4   Command Line: python3 -m pyanaconda.modules.users

Happy to try and gather more info or track this down. Thanks!

Reproducible: Sometimes

Steps to Reproduce:
1. compose fedora
2. see some livemedia failures

Comment 1 Miro Hrončok 2023-10-31 21:00:34 UTC
I have no idea where would I start investigating this or how. Is it possible to download the filesystem of the builder and boot it up in some VM?

Comment 2 Kevin Fenzi 2023-11-01 15:24:39 UTC
Yeah, I am really not sure how to debug this either. :( 

Perhaps we could first cc the anaconda folks and see if they see anything like it on their aarch64 testing/ci?

All of these are on bare metal aarch64 builders, with mock / systemd-nspawn and livemedia-creator running in that. 

I could also try and duplicate it in staging, but probibly not this week...

Comment 3 Kevin Fenzi 2023-11-01 15:25:27 UTC
Oh, I forgot, another thing I'd like to do is update/reboot those builders... but we are in f39 freeze and not making any changes there, but can hopefully soon...

Comment 4 Adam Williamson 2023-11-01 16:30:05 UTC
can we get the coredump and backtrace it at least?

Comment 5 Kevin Fenzi 2023-11-01 17:00:38 UTC
There isn't one. At least nothing in coredumpctl. ;( 

I am not sure why.

Comment 6 Adam Williamson 2023-11-01 17:25:42 UTC
it's ulimit, I think:

Oct 31 09:49:01 buildhw-a64-02.iad2.fedoraproject.org systemd-coredump[1773132]: Resource limits disable core dumping for process 1773084 (python3).

we should be able to change the limits and get a coredump.

Comment 7 Victor Stinner 2023-11-02 17:35:41 UTC
If you can get a coredump, I may be able to help you to debug it.

"python3 -m pyanaconda.modules.storage": is it easy to reproduce the issue?

Do you think that it would be possible to create a special image containing python3-debug and run anaconda programs (such as "python3 -m pyanaconda.modules.storage") with this program instead? Like run "python3-debug -X dev -m pyanaconda.modules.storage" instead. Adding "-X dev" enables more debug checks:

* -X dev: https://docs.python.org/dev/library/devmode.html
* python3-debug is a Python debug build: https://docs.python.org/dev/using/configure.html#python-debug-build

Comment 8 Adam Williamson 2023-11-02 18:19:21 UTC
we're working on getting a coredump - see https://pagure.io/fedora-infra/ansible/pull-request/1618 . we might be able to do a custom image also, I can look into that later.

Comment 9 Miro Hrončok 2023-11-13 10:16:54 UTC
Any luck getting a coredump since the PR has been merged?

Comment 10 Kevin Fenzi 2023-11-13 15:15:09 UTC
Sadly no. It didn't work. ;( 

I'll try and figure out another way...

Comment 11 Kevin Fenzi 2023-11-14 19:26:59 UTC
ok. Got a core finally...

           PID: 3558463 (python3)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 4 (ILL)
     Timestamp: Tue 2023-11-14 19:12:16 UTC (4min 15s ago)
  Command Line: python3 -m pyanaconda.modules.payloads
    Executable: /usr/bin/python3.12
 Control Group: /machine.slice/machine-a419a95f068a4ded988be5153ba67666.scope/payload
          Unit: machine-a419a95f068a4ded988be5153ba67666.scope
         Slice: machine.slice
       Boot ID: 45eb1ecb91fe45fc9c7bda79e918763c
    Machine ID: 5d8add178f344e378788cb2550e6a521
      Hostname: a419a95f068a4ded988be5153ba67666
       Storage: /var/lib/systemd/coredump/core.python3.0.45eb1ecb91fe45fc9c7bda79e918763c.3558463.1699989136000000.zst (present)
  Size on Disk: 2.7M
       Package: python3.12/3.12.0-2.fc40
      build-id: b9bd80861442b62dc97210c2bd0334084f6c734c
       Message: Process 3558463 (python3) of user 0 dumped core.
                
                Module /usr/bin/python3.12 from rpm python3.12-3.12.0-2.fc40.aarch64
                Module /usr/lib64/libdevmapper.so.1.02 from rpm lvm2-2.03.22-1.fc39.aarch64
                Module /usr/lib64/libbd_lvm-dbus.so.3.0.0 from rpm libblockdev-3.0.4-1.fc40.aarch64
                Module /usr/lib64/python3.12/site-packages/selinux/_selinux.cpython-312-aarch64-linux-gnu.so from rpm libselinux-3.5-5.fc39.aarch64
                Module /usr/lib64/libzstd.so.1.5.5 from rpm zstd-1.5.5-4.fc39.aarch64
                Module /usr/lib64/libcap.so.2.48 from rpm libcap-2.48-7.fc39.aarch64
                Module /usr/lib64/libkmod.so.2.4.1 from rpm kmod-31-2.fc40.aarch64
                Module /usr/lib64/libudev.so.1.7.7 from rpm systemd-254.5-2.fc40.aarch64
                Module /usr/lib64/libbd_utils.so.3.0.0 from rpm libblockdev-3.0.4-1.fc40.aarch64
                Module /usr/lib64/libblockdev.so.3.0.0 from rpm libblockdev-3.0.4-1.fc40.aarch64
                Module /usr/lib64/libgmp.so.10.4.1 from rpm gmp-6.2.1-5.fc39.aarch64
                Module /usr/lib64/libmpfr.so.6.2.1 from rpm mpfr-4.2.1-1.fc40.aarch64
                Module /usr/lib64/libbytesize.so.1.0.0 from rpm libbytesize-2.10-1.fc40.aarch64
                Module /usr/lib64/python3.12/lib-dynload/_ctypes.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64
                Module /usr/lib64/python3.12/lib-dynload/_opcode.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64
                Module /usr/lib64/python3.12/lib-dynload/array.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64
                Module /usr/lib64/python3.12/lib-dynload/_socket.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64
                Module /usr/lib64/libblkid.so.1.1.0 from rpm util-linux-2.39.2-1.fc40.aarch64
                Module /usr/lib64/libselinux.so.1 from rpm libselinux-3.5-5.fc39.aarch64
                Module /usr/lib64/libmount.so.1.1.0 from rpm util-linux-2.39.2-1.fc40.aarch64
                Module /usr/lib64/libgio-2.0.so.0.7800.1 from rpm glib2-2.78.1-1.fc40.aarch64
                Module /usr/lib64/libgmodule-2.0.so.0.7800.1 from rpm glib2-2.78.1-1.fc40.aarch64
                Module /usr/lib64/libpcre2-8.so.0.11.2 from rpm pcre2-10.42-2.fc40.aarch64
                Module /usr/lib64/libffi.so.8.1.2 from rpm libffi-3.4.4-4.fc39.aarch64
                Module /usr/lib64/libgobject-2.0.so.0.7800.1 from rpm glib2-2.78.1-1.fc40.aarch64
                Module /usr/lib64/libgirepository-1.0.so.1.0.0 from rpm gobject-introspection-1.78.1-1.fc40.aarch64
                Module /usr/lib64/libglib-2.0.so.0.7800.1 from rpm glib2-2.78.1-1.fc40.aarch64
                Module /usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so from rpm pygobject3-3.46.0-1.fc40.aarch64
                Module /usr/lib64/python3.12/lib-dynload/_struct.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64
                Module /usr/lib64/libmpdec.so.2.5.1 from rpm mpdecimal-2.5.1-7.fc39.aarch64
                Module /usr/lib64/python3.12/lib-dynload/_decimal.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64
                Module /usr/lib64/libuuid.so.1.3.0 from rpm util-linux-2.39.2-1.fc40.aarch64
                Module /usr/lib64/python3.12/lib-dynload/_uuid.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64
                Module /usr/lib64/libb2.so.1.0.4 from rpm libb2-0.98.1-9.fc39.aarch64
                Module /usr/lib64/python3.12/lib-dynload/_blake2.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64
                Module /usr/lib64/libcrypto.so.3.1.4 from rpm openssl-3.1.4-1.fc40.aarch64
                Module /usr/lib64/python3.12/lib-dynload/_hashlib.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64
                Module /usr/lib64/python3.12/lib-dynload/_random.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64
                Module /usr/lib64/python3.12/lib-dynload/_bisect.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64
                Module /usr/lib64/liblzma.so.5.4.5 from rpm xz-5.4.5-1.fc40.aarch64
                Module /usr/lib64/python3.12/lib-dynload/_lzma.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64
                Module /usr/lib64/libbz2.so.1.0.8 from rpm bzip2-1.0.8-16.fc39.aarch64
                Module /usr/lib64/python3.12/lib-dynload/_bz2.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64
                Module /usr/lib64/libz.so.1.2.13 from rpm zlib-1.2.13-5.fc40.aarch64
                Module /usr/lib64/python3.12/lib-dynload/zlib.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64
                Module /usr/lib64/python3.12/lib-dynload/math.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64
                Module /usr/lib64/python3.12/lib-dynload/select.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64
                Module /usr/lib64/python3.12/lib-dynload/_posixsubprocess.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64
                Module /usr/lib64/python3.12/lib-dynload/fcntl.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64
                Module /usr/lib64/python3.12/lib-dynload/syslog.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64
                Module /usr/lib64/libpython3.12.so.1.0 from rpm python3.12-3.12.0-2.fc40.aarch64
                Stack trace of thread 53:
                #0  0x0000ffffa2df1c80 n/a (/usr/lib64/libc.so.6 + 0x91c80)
                #1  0x0000ffffa2df1c6c n/a (/usr/lib64/libc.so.6 + 0x91c6c)
                #2  0x0000ffffa2da5800 n/a (/usr/lib64/libc.so.6 + 0x45800)
                #3  0x0000ffffa35de7f0 n/a (ld-linux-aarch64.so.1 + 0x4d7f0)
                #4  0x0000ffff93372cd8 n/a (/usr/lib64/libbd_utils.so.3.0.0 + 0x12cd8)
                #5  0x0000ffff930b4438 n/a (/usr/lib64/libbd_lvm-dbus.so.3.0.0 + 0x14438)
                #6  0x0000ffff933c1680 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x11680)
                #7  0x0000ffff933ccba0 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x1cba0)
                #8  0x0000ffff933cf958 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x1f958)
                #9  0x0000ffff933d6ee8 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x26ee8)
                #10 0x0000ffff93ecc050 n/a (/usr/lib64/libffi.so.8.1.2 + 0x1c050)
                #11 0x0000ffff93ec63ac n/a (/usr/lib64/libffi.so.8.1.2 + 0x163ac)
                #12 0x0000ffff941c1344 n/a (/usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so + 0x31344)
                #13 0x0000ffff941bfe4c n/a (/usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so + 0x2fe4c)
                #14 0x0000ffffa30aa008 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x17a008)
                #15 0x0000ffffa30b6518 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x186518)
                #16 0x0000ffffa316bf48 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x23bf48)
                #17 0x0000ffffa3193c04 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x263c04)
                #18 0x0000ffffa30d0bd0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a0bd0)
                #19 0x0000ffffa30b7868 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x187868)
                #20 0x0000ffffa30d9ed0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a9ed0)
                #21 0x0000ffffa3105b98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d5b98)
                #22 0x0000ffffa310516c n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d516c)
                #23 0x0000ffffa3113fc4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1e3fc4)
                #24 0x0000ffffa30d0bd0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a0bd0)
                #25 0x0000ffffa30b7868 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x187868)
                #26 0x0000ffffa30d9ed0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a9ed0)
                #27 0x0000ffffa3105b98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d5b98)
                #28 0x0000ffffa310516c n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d516c)
                #29 0x0000ffffa30ba964 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18a964)
                #30 0x0000ffffa316bf48 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x23bf48)
                #31 0x0000ffffa3193c04 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x263c04)
                #32 0x0000ffffa30d0bd0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a0bd0)
                #33 0x0000ffffa30b7868 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x187868)
                #34 0x0000ffffa30d9ed0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a9ed0)
                #35 0x0000ffffa3105b98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d5b98)
                #36 0x0000ffffa310516c n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d516c)
                #37 0x0000ffffa30ba964 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18a964)
                #38 0x0000ffffa316bf48 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x23bf48)
                #39 0x0000ffffa3193c04 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x263c04)
                #40 0x0000ffffa30d0bd0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a0bd0)
                #41 0x0000ffffa30b7868 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x187868)
                #42 0x0000ffffa30d9ed0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a9ed0)
                #43 0x0000ffffa3105b98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d5b98)
                #44 0x0000ffffa310516c n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d516c)
                #45 0x0000ffffa30ba964 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18a964)
                #46 0x0000ffffa316bf48 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x23bf48)
                #47 0x0000ffffa3193c04 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x263c04)
                #48 0x0000ffffa30d0bd0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a0bd0)
                #49 0x0000ffffa30d0678 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a0678)
                #50 0x0000ffffa30b6518 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x186518)
                #51 0x0000ffffa31acc40 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x27cc40)
                #52 0x0000ffffa31abefc n/a (/usr/lib64/libpython3.12.so.1.0 + 0x27befc)
                #53 0x0000ffffa3151c34 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x221c34)
                #54 0x0000ffffa2d909dc n/a (/usr/lib64/libc.so.6 + 0x309dc)
                #55 0x0000ffffa2d90ab0 n/a (/usr/lib64/libc.so.6 + 0x30ab0)
                #56 0x0000aaaad11f00f0 n/a (/usr/bin/python3.12 + 0x100f0)
                ELF object binary architecture: AARCH64

The file is at https://people.fedoraproject.org/~kevin/core.python3.0.45eb1ecb91fe45fc9c7bda79e918763c.3558463.1699989136000000.zst

let me know if I can add anymore.

Comment 12 Dan Horák 2023-11-14 19:39:31 UTC
Kevin, could you get an output of "coredumpctl gdb"?

Comment 13 Kevin Fenzi 2023-11-14 19:56:14 UTC
ok. Here's what I got...

Also, FYI, we are doing a update/reboot cycle tomorrow, so we can see if this even persists after updates... 

Core was generated by `python3 -m pyanaconda.modules.payloads'.                               
Program terminated with signal SIGILL, Illegal instruction.                            
#0  __pthread_kill_implementation (threadid=281473422540832, signo=signo@entry=4,             
    no_tid=no_tid@entry=0) at pthread_kill.c:44
Downloading source file /usr/src/debug/glibc-2.38.9000-19.fc40.aarch64/nptl/pthread_kill.c
44            return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;       
[Current thread is 1 (LWP 53)]
Missing separate debuginfos, use: dnf debuginfo-install bzip2-libs-1.0.8-13.fc38.aarch64 devic
e-mapper-libs-1.02.189-2.fc38.aarch64 glib2-2.76.5-2.fc38.aarch64 glibc-2.37-10.fc38.aarch64 g
mp-6.2.1-4.fc38.aarch64 gobject-introspection-1.76.1-1.fc38.aarch64 kmod-libs-30-4.fc38.aarch6
4 libb2-0.98.1-8.fc38.aarch64 libblkid-2.38.1-4.fc38.aarch64 libbytesize-2.9-1.fc38.aarch64 li
bcap-2.48-6.fc38.aarch64 libffi-3.4.4-2.fc38.aarch64 libgcc-13.2.1-1.fc38.aarch64 libgomp-13.2
.1-1.fc38.aarch64 libmount-2.38.1-4.fc38.aarch64 libselinux-3.5-1.fc38.aarch64 libuuid-2.38.1-
4.fc38.aarch64 libzstd-1.5.5-1.fc38.aarch64 mpdecimal-2.5.1-6.fc38.aarch64 mpfr-4.1.1-3.fc38.a
arch64 openssl-libs-3.0.9-2.fc38.aarch64 pcre2-10.42-1.fc38.1.aarch64 systemd-libs-253.10-1.fc
38.aarch64 xz-libs-5.4.1-1.fc38.aarch64 zlib-1.2.13-3.fc38.aarch64

(gdb) thread apply all bt full
                       
Thread 3 (LWP 69):                                                                            
#0  0x0000ffffa2e4d6e4 in __GI___poll (fds=0xaaaaf7fdf790, nfds=1, timeout=<optimized out>) at
 ../sysdeps/unix/sysv/linux/poll.c:41
        _x3tmp = 0                                                                            
        _x0tmp = 187651281778576
        _x0 = 187651281778576
        _x3 = 0                                
        _x4tmp = 0 


        _x1tmp = 1                                                                            
        _x1 = 1                                
        _x4 = 0                                
        _x2tmp = <optimized out>                                                              
        _x2 = 0                                                                               
        _x8 = 73
        _sys_result = <optimized out>      
        sc_ret = <optimized out>
        sc_cancel_oldtype = 0
        sc_ret = <optimized out> 
        timeout_ts = {tv_sec = 187651281778064, tv_nsec = 187651281778064}
        timeout_ts_p = <optimized out> 
#1  0x0000ffff940be5b0 in g_main_context_poll_unlocked (priority=2147483647, n_fds=1, fds=0xaa
aaf7fdf790, timeout=<optimized out>, context=0xaaaaf7fdf590) at ../glib/gmain.c:4653
        ret = <optimized out>                                                                 
        errsv = <optimized out>
        poll_func = 0xffff9406e2c0 <g_poll>
        poll_func = <optimized out>
        ret = <optimized out>                                                                 
        errsv = <optimized out>                                                               
#2  g_main_context_iterate_unlocked.isra.0 (context=context@entry=0xaaaaf7fdf590, block=block@
entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ../glib/gmain.c:4344             
        max_priority = 2147483647
        timeout = -1                           
        some_ready = <optimized out>                                                          
        nfds = 1
        allocated_nfds = 1
        fds = 0xaaaaf7fdf790
        begin_time_nsec = <optimized out>                                                     
#3  0x0000ffff9405db18 in g_main_context_iteration (context=0xaaaaf7fdf590, may_block=may_bloc
k@entry=1) at ../glib/gmain.c:4414
        retval = <optimized out>
#4  0x0000ffff9405db74 in glib_worker_main (data=<optimized out>) at ../glib/gmain.c:6574
No locals.                                     
#5  0x0000ffff940944c4 in g_thread_proxy (data=0xaaaaf7fdf730) at ../glib/gthread.c:831
        thread = 0xaaaaf7fdf730                                                               
        __func__ = "g_thread_proxy"
#6  0x0000ffffa2deff84 in start_thread (arg=0xffffa35d4760) at pthread_create.c:444
        ret = <optimized out>                                                                 
        pd = 0xffffa35d4760  
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {281473139601792, 281473139601792, 48, 2814
73422542688, 281473414266048, 0, 281473131151360, 281473415512064, 281473422540832, 2814731311
51360, 281473139599888, 11276355813591266238, 0, 11276355813333378930, 0, 0, 0, 0, 0, 0, 0, 0}
, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0,
 canceltype = 0}}}                             
        not_first_call = 0                                                                    
#7  0x0000ffffa2e58dcc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone3.S:76
No locals.

Thread 2 (LWP 66):
--Type <RET> for more, q to quit, c to continue without paging--c
#0  syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
No locals.
#1  0x0000ffff940badcc in g_cond_wait (cond=<optimized out>, mutex=0xaaaaf7fd54d0) at ../glib/
gthread-posix.c:1552
        saved_errno = 0
        res = <optimized out>
        sampled = 0
#2  0x0000ffff940293e8 in g_async_queue_pop_intern_unlocked (queue=0xaaaaf7fd54d0, wait=1, end
_time=-1) at ../glib/gasyncqueue.c:425
        retval = <optimized out>
        __func__ = "g_async_queue_pop_intern_unlocked"
#3  0x0000ffff9409669c in g_thread_pool_spawn_thread (data=<optimized out>) at ../glib/gthread
pool.c:311
        spawn_thread_data = <optimized out>
        thread = 0x0
        error = 0x0
        prgname = <optimized out>
        name = "pool", '\000' <repeats 11 times>
#4  0x0000ffff940944c4 in g_thread_proxy (data=0xaaaaf7e34c20) at ../glib/gthread.c:831
        thread = 0xaaaaf7e34c20
        __func__ = "g_thread_proxy"
#5  0x0000ffffa2deff84 in start_thread (arg=0xffffa35d4760) at pthread_create.c:444
        ret = <optimized out>
        pd = 0xffffa35d4760
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {281473148055936, 281473148055936, 48, 2814
73422542688, 281473414266048, 0, 281473139605504, 281473415512064, 281473422540832, 2814731396
05504, 281473148054032, 11276355813591266238, 0, 11276355813353629554, 0, 0, 0, 0, 0, 0, 0, 0}
, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0,
 canceltype = 0}}}
        not_first_call = 0
#6  0x0000ffffa2e58dcc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone3.S:76
No locals.

Thread 1 (LWP 53):
#0  __pthread_kill_implementation (threadid=281473422540832, signo=signo@entry=4, no_tid=no_ti
d@entry=0) at pthread_kill.c:44
        tid = 53
        ret = 0
        pd = 0xffffa35d4020
        old_mask = {__val = {0}}
        ret = <optimized out>
#1  0x0000ffffa2df1cf8 in __pthread_kill_internal (signo=4, threadid=<optimized out>) at pthre
ad_kill.c:78
No locals.
#2  0x0000ffffa2da5800 in __GI_raise (sig=4) at ../sysdeps/posix/raise.c:26
        ret = <optimized out>
#3  <signal handler called>
No symbol table info available.
#4  0x0000ffffa359cfe0 in ?? ()
No symbol table info available.
#5  0x0000aaaaf7e28e80 in ?? ()
No symbol table info available.
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Comment 14 Victor Stinner 2023-11-15 00:03:20 UTC
> Missing separate debuginfos, use: dnf debuginfo-install (...)

Oh, the debuginfod server was not enabled? https://fedoraproject.org/wiki/Debuginfod

You may install these debug symbols to get a more useful backtrace.

Sadly, the most interesting thread has an unusable bactrace:

Thread 1 (LWP 53):
...
#3  <signal handler called>
No symbol table info available.
#4  0x0000ffffa359cfe0 in ?? ()
No symbol table info available.
...

Comment 15 Victor Stinner 2023-11-15 00:06:23 UTC
Apparently, the crash is related to libblockdev, libbd_lvm-dbus and libbd_utils:

                Stack trace of thread 53:
                #0  0x0000ffffa2df1c80 n/a (/usr/lib64/libc.so.6 + 0x91c80)
                #1  0x0000ffffa2df1c6c n/a (/usr/lib64/libc.so.6 + 0x91c6c)
                #2  0x0000ffffa2da5800 n/a (/usr/lib64/libc.so.6 + 0x45800)
                #3  0x0000ffffa35de7f0 n/a (ld-linux-aarch64.so.1 + 0x4d7f0)
                #4  0x0000ffff93372cd8 n/a (/usr/lib64/libbd_utils.so.3.0.0 + 0x12cd8)
                #5  0x0000ffff930b4438 n/a (/usr/lib64/libbd_lvm-dbus.so.3.0.0 + 0x14438)
                #6  0x0000ffff933c1680 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x11680)
                #7  0x0000ffff933ccba0 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x1cba0)
                #8  0x0000ffff933cf958 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x1f958)
                #9  0x0000ffff933d6ee8 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x26ee8)
                #10 0x0000ffff93ecc050 n/a (/usr/lib64/libffi.so.8.1.2 + 0x1c050)
                #11 0x0000ffff93ec63ac n/a (/usr/lib64/libffi.so.8.1.2 + 0x163ac)
                #12 0x0000ffff941c1344 n/a (/usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so + 0x31344)
                #13 0x0000ffff941bfe4c n/a (/usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so + 0x2fe4c)

Comment 16 Victor Stinner 2023-11-15 00:10:02 UTC
From what I see, the problem comes more likely from libblockdev than from Python. I reassign the issue to libblockdev.

Comment 17 Kevin Fenzi 2023-11-15 00:14:59 UTC
> debuginfod server was not enabled?

It was. This builder hasn't had updates applied in a while (because we were in freeze before f39 release), so likely all those things have updated and the debuginfo for the older packages it has are not around anymore? in any case it did download a bunch of things, just not everything. ;(

Comment 18 Dan Horák 2023-11-15 09:08:18 UTC
If someone needs access to an aarch64 system with rawhide, let us know.

Bellow is a gdb session showing the disassembly of the function containing 0x0000ffff93372cd8 where the crash likely happens. The missing backtrace points to some memory/stack corruption (eg. data treated as an instruction) ...


[root@hpe-apollo-cn99xx-15-vm-30 ~]# gdb --quiet /usr/bin/python3.12 core.python3.0.45eb1ecb91fe45fc9c7bda79e918763c.3558463.1699989136000000
Reading symbols from /usr/bin/python3.12...

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.fedoraproject.org/>
Enable debuginfod for this session? (y or [n]) y
Debuginfod has been enabled.
To make this setting permanent, add 'set debuginfod enabled on' to .gdbinit.
Reading symbols from /root/.cache/debuginfod_client/b9bd80861442b62dc97210c2bd0334084f6c734c/debuginfo...                                                                                     
                                                                                                                                                                                              
warning: Can't open file /usr/lib/locale/locale-archive during file-backed mapping note processing
[New LWP 53]
[New LWP 66]
[New LWP 69]
                                                                                                                                                                                              
warning: Could not load shared library symbols for /usr/lib64/python3.12/site-packages/selinux/_selinux.cpython-312-aarch64-linux-gnu.so.
Do you need "set solib-search-path" or "set sysroot"?
                                                                                                                                                                                              [Thread debugging using libthread_db enabled]                                                                                                                                                 
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `python3 -m pyanaconda.modules.payloads'.
Program terminated with signal SIGILL, Illegal instruction.
#0  __pthread_kill_implementation (threadid=281473422540832, signo=signo@entry=4, no_tid=no_tid@entry=0) at pthread_kill.c:44
44            return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;                                                                                                       
[Current thread is 1 (Thread 0xffffa35d4020 (LWP 53))]
(gdb) disass 0x0000ffff93372cd8
Dump of assembler code for function bd_utils_log_format:
   0x0000ffff93372be0 <+0>:	paciasp
   0x0000ffff93372be4 <+4>:	sub	sp, sp, #0x120
   0x0000ffff93372be8 <+8>:	adrp	x8, 0xffff9339f000
   0x0000ffff93372bec <+12>:	ldr	x8, [x8, #4056]
   0x0000ffff93372bf0 <+16>:	stp	x29, x30, [sp, #80]
   0x0000ffff93372bf4 <+20>:	add	x29, sp, #0x50
   0x0000ffff93372bf8 <+24>:	stp	x19, x20, [sp, #96]
   0x0000ffff93372bfc <+28>:	adrp	x20, 0xffff933a0000 <log_level>
   0x0000ffff93372c00 <+32>:	mov	w19, w0
   0x0000ffff93372c04 <+36>:	str	q0, [sp, #112]
   0x0000ffff93372c08 <+40>:	str	q1, [sp, #128]
   0x0000ffff93372c0c <+44>:	str	q2, [sp, #144]
   0x0000ffff93372c10 <+48>:	str	q3, [sp, #160]
   0x0000ffff93372c14 <+52>:	str	q4, [sp, #176]
   0x0000ffff93372c18 <+56>:	str	q5, [sp, #192]
   0x0000ffff93372c1c <+60>:	str	q6, [sp, #208]
   0x0000ffff93372c20 <+64>:	str	q7, [sp, #224]
   0x0000ffff93372c24 <+68>:	stp	x2, x3, [sp, #240]
   0x0000ffff93372c28 <+72>:	stp	x4, x5, [sp, #256]
   0x0000ffff93372c2c <+76>:	stp	x6, x7, [sp, #272]
   0x0000ffff93372c30 <+80>:	ldr	x0, [x20, #8]
   0x0000ffff93372c34 <+84>:	ldr	x2, [x8]
   0x0000ffff93372c38 <+88>:	str	x2, [sp, #72]
   0x0000ffff93372c3c <+92>:	mov	x2, #0x0                   	// #0
   0x0000ffff93372c40 <+96>:	str	xzr, [sp, #32]
   0x0000ffff93372c44 <+100>:	cbz	x0, 0xffff93372ce0 <bd_utils_log_format+256>
   0x0000ffff93372c48 <+104>:	adrp	x2, 0xffff933a0000 <log_level>
   0x0000ffff93372c4c <+108>:	mov	x0, #0x0                   	// #0
   0x0000ffff93372c50 <+112>:	ldr	w2, [x2]
   0x0000ffff93372c54 <+116>:	cmp	w2, w19
   0x0000ffff93372c58 <+120>:	b.ge	0xffff93372c90 <bd_utils_log_format+176>  // b.tcont
   0x0000ffff93372c5c <+124>:	bl	0xffff93370060 <g_free@plt>
   0x0000ffff93372c60 <+128>:	adrp	x0, 0xffff9339f000
   0x0000ffff93372c64 <+132>:	ldr	x0, [x0, #4056]
   0x0000ffff93372c68 <+136>:	ldr	x2, [sp, #72]
   0x0000ffff93372c6c <+140>:	ldr	x1, [x0]
   0x0000ffff93372c70 <+144>:	subs	x2, x2, x1
   0x0000ffff93372c74 <+148>:	mov	x1, #0x0                   	// #0
   0x0000ffff93372c78 <+152>:	b.ne	0xffff93372cf4 <bd_utils_log_format+276>  // b.any
   0x0000ffff93372c7c <+156>:	ldp	x29, x30, [sp, #80]
   0x0000ffff93372c80 <+160>:	ldp	x19, x20, [sp, #96]
   0x0000ffff93372c84 <+164>:	add	sp, sp, #0x120
   0x0000ffff93372c88 <+168>:	autiasp
   0x0000ffff93372c8c <+172>:	ret
   0x0000ffff93372c90 <+176>:	mov	w3, #0xffffff80            	// #-128
   0x0000ffff93372c94 <+180>:	mov	w0, #0xffffffd0            	// #-48
   0x0000ffff93372c98 <+184>:	stp	w0, w3, [sp, #64]
   0x0000ffff93372c9c <+188>:	add	x3, sp, #0x28
   0x0000ffff93372ca0 <+192>:	add	x2, sp, #0x120
   0x0000ffff93372ca4 <+196>:	add	x4, sp, #0xf0
   0x0000ffff93372ca8 <+200>:	stp	x2, x2, [sp, #40]
   0x0000ffff93372cac <+204>:	mov	x2, sp
   0x0000ffff93372cb0 <+208>:	add	x0, sp, #0x20
   0x0000ffff93372cb4 <+212>:	str	x4, [sp, #56]
   0x0000ffff93372cb8 <+216>:	ldp	q0, q1, [x3]
   0x0000ffff93372cbc <+220>:	stp	q0, q1, [x2]
   0x0000ffff93372cc0 <+224>:	bl	0xffff93370760 <g_vasprintf@plt>
   0x0000ffff93372cc4 <+228>:	tbnz	w0, #31, 0xffff93372ce8 <bd_utils_log_format+264>
   0x0000ffff93372cc8 <+232>:	ldr	x1, [sp, #32]
   0x0000ffff93372ccc <+236>:	mov	w0, w19
   0x0000ffff93372cd0 <+240>:	ldr	x2, [x20, #8]
   0x0000ffff93372cd4 <+244>:	blr	x2
   0x0000ffff93372cd8 <+248>:	ldr	x0, [sp, #32]
   0x0000ffff93372cdc <+252>:	b	0xffff93372c5c <bd_utils_log_format+124>
   0x0000ffff93372ce0 <+256>:	mov	x0, #0x0                   	// #0
   0x0000ffff93372ce4 <+260>:	b	0xffff93372c5c <bd_utils_log_format+124>
   0x0000ffff93372ce8 <+264>:	ldr	x0, [sp, #32]
   0x0000ffff93372cec <+268>:	bl	0xffff93370060 <g_free@plt>
   0x0000ffff93372cf0 <+272>:	b	0xffff93372c60 <bd_utils_log_format+128>
   0x0000ffff93372cf4 <+276>:	bl	0xffff93370470 <__stack_chk_fail@plt>
End of assembler dump.

Comment 19 Zbigniew Jędrzejewski-Szmek 2023-11-15 11:59:36 UTC
Is bd_utils_log_format called with an invalid format string and/or args?
It is a wrapper around vasprintf, so with a bad format string it could scribble over memory.
Does anyone know how to figure out where bd_utils_log_format was called?

Comment 20 Adam Williamson 2023-11-17 00:05:10 UTC
There are only 33 places in libblockdev that call bd_utils_log_format with a format specifier (always either %s or %d). Well, on one line, anyway. So...that might not be too many to just go through them and look for suspicious ones...

[adamw@xps13a libblockdev (master)]$ grep -R bd_utils_log_format | grep %
scripts/boilerplate_generator.py:    ret += '        bd_utils_log_format (BD_UTILS_LOG_WARNING, "failed to load module {0}: %s", dlerror());\n'.format(module_name)
scripts/boilerplate_generator.py:    ret += '        bd_utils_log_format (BD_UTILS_LOG_DEBUG, "failed to load the init() function for {0}: %s", error);\n'.format(module_name)
scripts/boilerplate_generator.py:        ret += '        bd_utils_log_format (BD_UTILS_LOG_WARNING, "failed to load {0.name}: %s", error);\n\n'.format(info)
scripts/boilerplate_generator.py:    ret += '        bd_utils_log_format (BD_UTILS_LOG_DEBUG, "failed to load the close_plugin() function for {0}: %s", error);\n'.format(module_name)
src/lib/blockdev.c.in:            bd_utils_log_format (BD_UTILS_LOG_WARNING, "Cannot process the config file '%s': %s. Skipping.",
src/lib/blockdev.c.in:            bd_utils_log_format (BD_UTILS_LOG_WARNING, "Failed to load config files: %s. Using the built-in config",
src/lib/blockdev.c.in:        bd_utils_log_format (BD_UTILS_LOG_INFO, "Failed to load config files: %s. Using the built-in config",
src/plugins/btrfs.c:            bd_utils_log_format (BD_UTILS_LOG_WARNING, "%s", error->msg);
src/plugins/btrfs.c:            bd_utils_log_format (BD_UTILS_LOG_WARNING, "%s", error->msg);
src/plugins/btrfs.c:            bd_utils_log_format (BD_UTILS_LOG_WARNING, "%s", error->msg);
src/plugins/crypto.c:            bd_utils_log_format (BD_UTILS_LOG_WARNING, "Unknown cryptsetup log level %d.", level);
src/plugins/crypto.c:                bd_utils_log_format (BD_UTILS_LOG_ERR, "Failed to deactivate temporary device %s", tmp_name);
src/plugins/crypto.c:            bd_utils_log_format (BD_UTILS_LOG_ERR, "Failed to deactivate temporary device %s", tmp_name);
src/plugins/lvm-dbus.c:        bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to get system bus address: %s\n", (*error)->message);
src/plugins/lvm-dbus.c:        bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to create a new connection for the system bus: %s\n", (*error)->message);
src/plugins/lvm-dbus.c:        bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to setup DBus connection: %s", error->message);
src/plugins/lvm-dbus.c:        bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to flush DBus connection: %s", error->message);
src/plugins/lvm-dbus.c:        bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to close DBus connection: %s", error->message);
src/plugins/lvm-dbus.c:                bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Got error when getting progress: %s", l_error->message);
src/plugins/lvm-dbus.c:            bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Failed to get VG properties for PV %s: %s",
src/plugins/lvm-dbus.c:        bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO operating mode: %s", value);
src/plugins/lvm-dbus.c:        bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO compression state: %s", value);
src/plugins/lvm-dbus.c:        bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO index state: %s", value);
src/plugins/lvm-dbus.c:        bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO write policy: %s", value);
src/plugins/lvm.c:        bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO operating mode: %s", value);
src/plugins/lvm.c:        bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO compression state: %s", value);
src/plugins/lvm.c:        bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO index state: %s", value);
src/plugins/lvm.c:        bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO write policy: %s", value);
src/plugins/nvme/nvme-info.c:                bd_utils_log_format (BD_UTILS_LOG_WARNING, "Unhandled self-test log entry result code: %d", dsts);
src/plugins/nvme/nvme-info.c:                bd_utils_log_format (BD_UTILS_LOG_WARNING, "Unhandled self-test log entry action code: %d", code);
src/plugins/part.c:            bd_utils_log_format (BD_UTILS_LOG_INFO, "Not resizing, partition '%s' is already at its maximum size.", part);
src/plugins/s390.c:        bd_utils_log_format (BD_UTILS_LOG_WARNING, "Device %s status is %s, needs dasdfmt.", dasd, status);
src/plugins/s390.c:        bd_utils_log_format (BD_UTILS_LOG_WARNING, "Device %s is already online", devno);

Comment 21 Adam Williamson 2023-11-18 02:06:42 UTC
another dumb idea: do a custom libblockdev build that just adds a silly "XXX I AM LINE 123 OF FILE.C ABOUT TO LOG SOMETHING OH BOY!" printf before each of those, then run some scratch koji tasks with that build of libblockdev, and hope that tells us which bd_utils_log_format call is the problematic one...

Comment 22 Tomáš Bžatek 2023-11-20 16:58:27 UTC
Hmm, isn't the compiler supposed to warn on invalid types supplied to printf-like funcs? Perhaps the 
>  void bd_utils_log_format (gint level, const gchar *format, ...) {
definition is too vague and should be transformed to something like
>  void bd_utils_log_format (gint level, const gchar *format) G_GNUC_FORMAT (2) {
(which expands to '__attribute__ ((format_arg (2)))') to get proper annotations. However that requires a string return value which is not the case here.

Anyway, looking at lvm-dbus.c, the worst case of a bad value is passed in is NULL which for "%s" should be gracefully translated to "(null)".

Except... https://github.com/storaged-project/libblockdev/pull/990

Comment 23 Dan Horák 2023-11-20 17:03:26 UTC
(In reply to Tomáš Bžatek from comment #22)
> Hmm, isn't the compiler supposed to warn on invalid types supplied to
> printf-like funcs? Perhaps the 

it should and I haven't seen any such warnings

> >  void bd_utils_log_format (gint level, const gchar *format, ...) {
> definition is too vague and should be transformed to something like
> >  void bd_utils_log_format (gint level, const gchar *format) G_GNUC_FORMAT (2) {
> (which expands to '__attribute__ ((format_arg (2)))') to get proper
> annotations. However that requires a string return value which is not the
> case here.
> 
> Anyway, looking at lvm-dbus.c, the worst case of a bad value is passed in is
> NULL which for "%s" should be gracefully translated to "(null)".
> 
> Except... https://github.com/storaged-project/libblockdev/pull/990

Comment 24 Dennis Gilmore 2024-03-15 22:22:29 UTC
This is currently happening on almost every build. its been over 10 days since it has not happened.

Comment 25 Katerina Koukiou 2024-03-19 13:57:22 UTC
*** Bug 2265238 has been marked as a duplicate of this bug. ***

Comment 26 Adam Williamson 2024-03-21 23:15:47 UTC
We signed off Beta today, so there's no point in Beta FE status any more. Clearing that nomination.

Comment 27 Adam Williamson 2024-03-21 23:17:59 UTC
The fixes from comment #22 were in 3.1.0, which is stable for F39, F40 and Rawhide. So I guess those did not fix this.

Comment 28 Adam Williamson 2024-03-22 01:16:56 UTC
So it's a bit of a shot in the dark, but some research into when this seems to have started combined with the history of libblockdev made me want to try this: https://src.fedoraproject.org/rpms/libblockdev/c/ffea9acbba2af77ed6b7eaa500b6388addf05eea?branch=rawhide . I've sent that build to Rawhide. I figure we can give it a day or two and see if the images clear up, or not. If they do, we know it's one of those few that causes the problem.

Comment 29 Adam Williamson 2024-03-22 15:33:37 UTC
Unfortunately, that didn't work. We're stratergerizing new dumb ways to try and pin this down now. :D

Comment 30 Dan Horák 2024-03-22 15:45:26 UTC
Is there a way to reproduce the crash outside koji? Then we could experiment with instrumented libblockdev builds and similar ...

Comment 31 Adam Williamson 2024-03-22 16:11:49 UTC
That's one of the things I'm intending to look into. Though given the bug is a SIGILL, it's *possible* it's specific to the type of hardware these jobs run on. I'm going to fiddle around on a Jetson Nano I have here and see if I can reproduce it on that.

Even if we can't, we're thinking it might be possible to write a little test script that runs enough of anaconda on the koji builders to hit the bug, or maybe use scratch builds.

Comment 32 Adam Williamson 2024-03-22 23:34:23 UTC
So I spent quite a lot of time on this today, but with inconclusive results :/

It's not that hard to reproduce what pungi/koji do - ultimately, run livemedia-creator in a mock chroot. So I've been doing that all day, tried it on my local aarch64 box first, then an openQA worker host I control, then finally on one of the actual Koji builders that hits the bug.

I managed to reproduce the failure twice on the Koji builder, but that was twice out of like 50 attempts. Didn't reproduce it all on the other two hosts. It seems like it happens much less often when I'm trying to reproduce it manually than it does when we run the real composes, for some reason. That low level of reproducibility makes it hard to test anything with confidence :|

Comment 33 Adam Williamson 2024-03-22 23:43:31 UTC
One thing I did notice: when this bug happens, it seems to happen right after the "anaconda 41.2-2.fc41 for Fedora-i3-Live Rawhide (pre-release) started." message is printed. The thing anaconda does right after that is...a bunch of fiddling with signal handling. the 'print_startup_note' here prints that message, then see what happens next...

    startup_utils.print_startup_note(options=opts)

    from pyanaconda.anaconda import Anaconda
    anaconda = Anaconda()

    # reset python's default SIGINT handler
    signal.signal(signal.SIGINT, signal.SIG_IGN)
    signal.signal(signal.SIGTERM, lambda num, frame: sys.exit(1))

    # synchronously-delivered signals such as SIGSEGV and SIGILL cannot be handled properly from
    # Python, so install signal handlers from the faulthandler stdlib module.
    import faulthandler
    faulthandler.enable()

not sure if it's really relevant, but it seems like an interesting coincidence...

Comment 34 Adam Williamson 2024-03-23 16:44:51 UTC
Just to save anyone duplicating work while we confirm this - my last dumb diagnosis patch may have hit paydirt. I added a pre-log log before every bd_utils_log_format in lvm-dbus.c , and comparing the counts of pre-log messages and actual log messages in the resulting jobs, we found a discrepancy:

[adamw@xps13a tmp]$ grep "FAILED TO CREATE" anaconda-dbus.log | wc -l
11
[adamw@xps13a tmp]$ grep "Failed to create" anaconda-dbus.log  | wc -l
9

in one case we got a very nice smoking gun in the livemedia-out.log:

2024-03-23 09:04:49,305 INFO pylorax: Running anaconda.
2024-03-23 09:04:49,813 INFO pylorax: 
2024-03-23 09:04:49,814 INFO pylorax: ** (process:1): CRITICAL **: 09:04:49.668: XXX ABOUT TO LOG 'FAILED TO CREATE A NEW CONNECTION'
2024-03-23 09:04:50,815 ERROR pylorax: Running anaconda failed: process '['unshare', '--pid', '--kill-child', '--mount', '--propagation', 'unchanged', 'anaconda', '--kickstart', '/chroot_tmpdir/koji-image-f41-build-115326908.ks', '--cmdline', '--dirinstall', '--remotelog', '127.0.0.1:39053']' was killed by signal 11

Note that there we see my 'pre-log' message but *not* the 'real' log message that should have come immediately after it, and then anaconda dies. So, we think the offending log line is likely this one:

bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to create a new connection for the system bus: %s\n", (*error)->message);

I suspect the issue is that the 'error' there is not always populated. There are several cases of this pattern in the file: we initialize an error variable, pass it to some function or other, and expect that if that function call fails, the error will be populated when it gets back to us. So here we're doing pretty much this, condensed and simplified (in fact the error creation is in one function and rest is in another which is passed 'error') - please excuse any mistakes in C pointer syntax while I was condensing this, I am not a licensed C coder :>):

GError *error = NULL;
bus = g_dbus_connection_new_for_address_sync (addr, G_DBUS_CONNECTION_FLAGS_AUTHENTICATION_CLIENT|G_DBUS_CONNECTION_FLAGS_MESSAGE_BUS_CONNECTION,NULL, NULL, &error);
if (!bus || g_dbus_connection_is_closed (bus)) {
    bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to create a new connection for the system bus: %s\n", (*error)->message);
}

so what I suspect is happening is that it's not safe to assume 'error' will always have been populated by the g_dbus_connection_new_for_address_sync call (perhaps especially if we have a 'bus', but the connection to it is closed for some reason). The fix is probably something like this:

        if (error) {
            bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to create a new connection for the system bus: %s\n", (*error)->message);
        }
        else {
            bd_utils_log (BD_UTILS_LOG_CRIT, "Failed to create a new connection for the system bus: unknown reason");
        }

but for now we're testing a build which takes the log message out entirely, to see if that 'fixes' it. That build is going through tests currently, once it's stable we'll run a new compose and see what happens.

Comment 35 Adam Williamson 2024-03-24 00:33:00 UTC
`bd_lvm_init`, the function that calls `setup_dbus_connection` where the above logic is, similarly tries to log using the same error variable, so that call also needs guarding. I've sent a build with that changed also and will see how the next nightly compose fares.

Comment 36 Elliott Sales de Andrade 2024-03-24 01:03:50 UTC
(In reply to Adam Williamson from comment #34)
> GError *error = NULL;
> bus = g_dbus_connection_new_for_address_sync (addr,
> G_DBUS_CONNECTION_FLAGS_AUTHENTICATION_CLIENT|G_DBUS_CONNECTION_FLAGS_MESSAGE
> _BUS_CONNECTION,NULL, NULL, &error);
> if (!bus || g_dbus_connection_is_closed (bus)) {
>     bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to create a new
> connection for the system bus: %s\n", (*error)->message);
> }

This is definitely suspect; on the one hand, it's checking that `bus` is NULL, and on the other it's checking if it's closed. If `bus` is NULL, then the `g_dbus_connection_new_for_address_sync` call must have failed and set `error`, but if it's not NULL, then there's no reason for `error` to be set.

It might be easier to handle both this function and its caller(s?) by setting an error if necessary:

bus = g_dbus_connection_new_for_address_sync (addr, G_DBUS_CONNECTION_FLAGS_AUTHENTICATION_CLIENT|G_DBUS_CONNECTION_FLAGS_MESSAGE_BUS_CONNECTION,NULL, NULL, &error);
if (!bus || g_dbus_connection_is_closed (bus)) {
    if (error == NULL) {
        g_set_error_literal (&error, G_IO_ERROR, G_IO_ERROR_CLOSED, "connection was closed immediately");
    }
    bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to create a new connection for the system bus: %s\n", (*error)->message);
    return;
}

Comment 37 Adam Williamson 2024-03-24 06:27:02 UTC
"This is definitely suspect; on the one hand, it's checking that `bus` is NULL, and on the other it's checking if it's closed. If `bus` is NULL, then the `g_dbus_connection_new_for_address_sync` call must have failed and set `error`, but if it's not NULL, then there's no reason for `error` to be set."

Yep, that's pretty much what I thought. For now I just want to check the theory, I didn't want to bother too merge about writing an upstreamable patch immediately, so I've converted both log messages to `bd_utils_log` calls with no templating. We'll see if that helps in the next compose.

Comment 38 Adam Williamson 2024-03-24 15:33:27 UTC
Unfortunately that didn't fix it :( And it was such a good-looking theory, too. I am at a bit of a loss, now. I do wonder if we can get some *more* coredumps and compare them, maybe that would shed some light?

Comment 39 Adam Williamson 2024-03-24 16:13:45 UTC
I suppose one thing that would help with trying to figure out what's going on is if each anaconda dbus module could prefix all its log messages in anaconda-dbus.log with its name.

AIUI, we have 10 different modules starting up there, and one or more of them is crashing on this illegal instruction. But all the log messages from all of them are just thrown in that log together, which makes it difficult or impossible to distinguish them. If the messages were all prefixed we might be able to at least figure out how far each one that crashes gets before it crashes...

Comment 40 Dan Horák 2024-03-25 08:23:19 UTC
Thanks, Adam, for your debugging efforts. I can only hope I haven't made a mistake in the initial analysis.

Comment 41 Dan Horák 2024-03-25 17:03:32 UTC
So far I haven't been successful in reproducing the issue (on Ampere Mt.Snow machine). My steps to reproduce it are

- grab mock config from koji (regular rawhide might work too)
- init chroot with "mock -r ./f41-build.cfg init
- mock -r ./f41-build.cfg install bash coreutils glibc-all-langpacks lorax-lmc-novirt  selinux-policy-targeted shadow-utils util-linux
- go into the mock shell
- grap the kickstart with eg. "/sbin/livemedia-creator --ks ./koji-image-f41-build-115361236.ks --logfile /chroot_tmpdir/lmc-logs/livemedia-out.log --no-virt --resultdir /chroot_tmpdir/lmc --project Fedora-Workstation-Live --make-iso --volid Fedora-WS-Live-rawh-20240324.n.0 --iso-only --iso-name Fedora-Workstation-Live-aarch64-Rawhide-20240324.n.0.iso --releasever Rawhide --nomacboot" (taken from mock_output.log)
- repeat previous step

FWIW the last i3 Live images fail on some dependency error I think

Comment 42 Adam Williamson 2024-03-25 17:22:42 UTC
yeah, it was very difficult to reproduce for me too. I only managed it twice, trying to do it on the actual builder hardware (never reproduced it anywhere else), retrying over and over and over in a mock chroot.

There's clearly something about the environment of an actual compose which makes it happen way more often. Kevin and I suspect it might be having multiple tasks including multiple live image builds running on the same host at once. That's a bit awkward to try and reproduce manually, of course.

It definitely happens very frequently in real composes, we rarely get a compose without at least one image affected by the bug.

Comment 43 Adam Williamson 2024-03-25 21:44:33 UTC
So, hum. Looking at the two failures on this in today's Rawhide compose - https://koji.fedoraproject.org/koji/taskinfo?taskID=115400251 and https://koji.fedoraproject.org/koji/taskinfo?taskID=115400246 - the anaconda-dbus.log for each logs *ten* instances each of "XXX Failed to create a new connection for the system bus" and "XXX Failed to setup DBus connection" (which are the two messages we should see when hitting the path I modified).

That's one short. When we don't see this problem, we log *eleven* instances each of those messages. I think that's one pair for the main process and one pair each for the ten modules that get started.

The interesting thing is, if I'm following it right, the codepath we actually follow through lvm-dbus.c is very short. `bd_lvm_init` is, as its comment says, "called automatically by the library's initialization functions." It calls `setup_dbus_connection`, which I believe in the weird environment we are running in, is always expected to fail. It should then log "XXX Failed to setup DBus connection" and return FALSE. That's the whole codepath.

That makes it look like the only other place it could really be failing is the other templated log call in setup_dbus_connection:

        bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to get system bus address: %s\n", (*error)->message);

but...the fly in the ointment there is that, when I did the build with "pre-log" lines, we never hit the "pre-log" message for that log message, on any of the logs I checked. We always hit the "Failed to create a new connection for the system bus" one, not that earlier one.

I'm really baffled as to how or where this is still blowing up. It's pretty weird.

Comment 44 Adam Williamson 2024-03-25 22:05:46 UTC
so, I poked at a recent coredump with gdb. I tried to follow what Danny did. the stack trace for this crash is:

                Stack trace of thread 38:
                #0  0x0000ffffb8f91fe0 n/a (ld-linux-aarch64.so.1 + 0xbfe0)
                #1  0x0000ffffa85e443c n/a (/usr/lib64/libbd_lvm-dbus.so.3.0.0 + 0x1443c)
                #2  0x0000ffffa8971648 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x11648)
                #3  0x0000ffffa89809ec n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x209ec)
                #4  0x0000ffffa89884a0 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x284a0)
                #5  0x0000ffffa8985d70 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x25d70)
                #6  0x0000ffffa925c054 n/a (/usr/lib64/libffi.so.8.1.4 + 0x1c054)
                #7  0x0000ffffa9256404 n/a (/usr/lib64/libffi.so.8.1.4 + 0x16404)
                #8  0x0000ffffa95512d4 n/a (/usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so + 0x312d4)
                #9  0x0000ffffa95508b8 n/a (/usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so + 0x308b8)
                #10 0x0000ffffb897b1f8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x17b1f8)
                #11 0x0000ffffb8987340 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x187340)
                #12 0x0000ffffb8a40f98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x240f98)
                #13 0x0000ffffb8a67f60 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x267f60)
                #14 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac)
                #15 0x0000ffffb898afd4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18afd4)
                #16 0x0000ffffb89abaf8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1abaf8)
                #17 0x0000ffffb89d8298 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d8298)
                #18 0x0000ffffb89d7838 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d7838)
                #19 0x0000ffffb89e6be0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1e6be0)
                #20 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac)
                #21 0x0000ffffb898afd4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18afd4)
                #22 0x0000ffffb89abaf8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1abaf8)
                #23 0x0000ffffb89d8298 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d8298)
                #24 0x0000ffffb89d7838 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d7838)
                #25 0x0000ffffb898c154 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18c154)
                #26 0x0000ffffb8a40f98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x240f98)
                #27 0x0000ffffb8a67f60 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x267f60)
                #28 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac)
                #29 0x0000ffffb898afd4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18afd4)
                #30 0x0000ffffb89abaf8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1abaf8)
                #31 0x0000ffffb89d8298 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d8298)
                #32 0x0000ffffb89d7838 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d7838)
                #33 0x0000ffffb898c154 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18c154)
                #34 0x0000ffffb8a40f98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x240f98)
                #35 0x0000ffffb8a67f60 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x267f60)
                #36 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac)
                #37 0x0000ffffb898afd4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18afd4)
                #38 0x0000ffffb89abaf8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1abaf8)
                #39 0x0000ffffb89d8298 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d8298)
                #40 0x0000ffffb89d7838 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d7838)
                #41 0x0000ffffb898c154 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18c154)
                #42 0x0000ffffb8a40f98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x240f98)
                #43 0x0000ffffb8a67f60 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x267f60)
                #44 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac)
                #45 0x0000ffffb898afd4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18afd4)
                #46 0x0000ffffb89abaf8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1abaf8)
                #47 0x0000ffffb89d8298 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d8298)
                #48 0x0000ffffb89d7838 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d7838)
                #49 0x0000ffffb898c154 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18c154)
                #50 0x0000ffffb8a40f98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x240f98)
                #51 0x0000ffffb8a67f60 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x267f60)
                #52 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac)
                #53 0x0000ffffb898afd4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18afd4)
                #54 0x0000ffffb89abaf8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1abaf8)
                #55 0x0000ffffb89d8298 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d8298)
                #56 0x0000ffffb89d7838 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d7838)
                #57 0x0000ffffb898c154 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18c154)
                #58 0x0000ffffb8a40f98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x240f98)
                #59 0x0000ffffb8a67f60 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x267f60)
                #60 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac)
                #61 0x0000ffffb89a1f38 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a1f38)
                #62 0x0000ffffb8987340 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x187340)
                #63 0x0000ffffb8a816e4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x2816e4)

so in gdb I did this (note gdb is having issues with the trace, but never mind that I guess?):

(gdb) bt
#0  0x0000ffffb8f91fe0 in ?? ()
#1  0x0000ffffa85e4360 in ?? () at /usr/src/debug/libblockdev-3.1.0-6.fc41.aarch64/src/plugins/lvm-dbus.c:252 from /lib64/libbd_lvm-dbus.so.3
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) disass 0x0000ffffa85e443c

and this is what I got:

Dump of assembler code for function bd_lvm_init:
   0x0000ffffa85e4360 <+0>:	paciasp
   0x0000ffffa85e4364 <+4>:	sub	sp, sp, #0x40
   0x0000ffffa85e4368 <+8>:	adrp	x0, 0xffffa860f000
   0x0000ffffa85e436c <+12>:	ldr	x0, [x0, #4056]
   0x0000ffffa85e4370 <+16>:	stp	x29, x30, [sp, #16]
   0x0000ffffa85e4374 <+20>:	add	x29, sp, #0x10
   0x0000ffffa85e4378 <+24>:	stp	x19, x20, [sp, #32]
   0x0000ffffa85e437c <+28>:	adrp	x19, 0xffffa8610000
   0x0000ffffa85e4380 <+32>:	ldr	x1, [x19, #64]
   0x0000ffffa85e4384 <+36>:	ldr	x2, [x0]
   0x0000ffffa85e4388 <+40>:	str	x2, [sp, #8]
   0x0000ffffa85e438c <+44>:	mov	x2, #0x0                   	// #0
   0x0000ffffa85e4390 <+48>:	str	xzr, [sp]
   0x0000ffffa85e4394 <+52>:	cbz	x1, 0xffffa85e43e0 <bd_lvm_init+128>
   0x0000ffffa85e4398 <+56>:	adrp	x0, 0xffffa85e0000 <_init>
   0x0000ffffa85e439c <+60>:	add	x0, x0, #0xac0
   0x0000ffffa85e43a0 <+64>:	bl	0xffffa85e0840 <dm_log_with_errno_init@plt>
   0x0000ffffa85e43a4 <+68>:	mov	w0, #0x6                   	// #6
   0x0000ffffa85e43a8 <+72>:	bl	0xffffa85e0060 <dm_log_init_verbose@plt>
   0x0000ffffa85e43ac <+76>:	mov	w0, #0x1                   	// #1
   0x0000ffffa85e43b0 <+80>:	adrp	x1, 0xffffa860f000
   0x0000ffffa85e43b4 <+84>:	ldr	x1, [x1, #4056]
   0x0000ffffa85e43b8 <+88>:	ldr	x3, [sp, #8]
   0x0000ffffa85e43bc <+92>:	ldr	x2, [x1]
   0x0000ffffa85e43c0 <+96>:	subs	x3, x3, x2
   0x0000ffffa85e43c4 <+100>:	mov	x2, #0x0                   	// #0
   0x0000ffffa85e43c8 <+104>:	b.ne	0xffffa85e4488 <bd_lvm_init+296>  // b.any
   0x0000ffffa85e43cc <+108>:	ldp	x29, x30, [sp, #16]
   0x0000ffffa85e43d0 <+112>:	ldp	x19, x20, [sp, #32]
   0x0000ffffa85e43d4 <+116>:	add	sp, sp, #0x40
   0x0000ffffa85e43d8 <+120>:	autiasp
   0x0000ffffa85e43dc <+124>:	ret
   0x0000ffffa85e43e0 <+128>:	mov	x2, sp
   0x0000ffffa85e43e4 <+132>:	mov	w0, #0x1                   	// #1
   0x0000ffffa85e43e8 <+136>:	str	x21, [sp, #48]
   0x0000ffffa85e43ec <+140>:	bl	0xffffa85e06b0 <g_dbus_address_get_for_bus_sync@plt>
   0x0000ffffa85e43f0 <+144>:	mov	x20, x0
   0x0000ffffa85e43f4 <+148>:	cbz	x0, 0xffffa85e446c <bd_lvm_init+268>
   0x0000ffffa85e43f8 <+152>:	mov	w1, #0x9                   	// #9
   0x0000ffffa85e43fc <+156>:	mov	x4, sp
   0x0000ffffa85e4400 <+160>:	mov	x3, #0x0                   	// #0
   0x0000ffffa85e4404 <+164>:	mov	x2, #0x0                   	// #0
   0x0000ffffa85e4408 <+168>:	bl	0xffffa85e0490 <g_dbus_connection_new_for_address_sync@plt>
   0x0000ffffa85e440c <+172>:	mov	x1, x0
   0x0000ffffa85e4410 <+176>:	mov	x0, x20
   0x0000ffffa85e4414 <+180>:	str	x1, [x19, #64]
   0x0000ffffa85e4418 <+184>:	bl	0xffffa85e0080 <g_free@plt>
   0x0000ffffa85e441c <+188>:	ldr	x0, [x19, #64]
   0x0000ffffa85e4420 <+192>:	cbz	x0, 0xffffa85e442c <bd_lvm_init+204>
   0x0000ffffa85e4424 <+196>:	bl	0xffffa85e0900 <g_dbus_connection_is_closed@plt>
   0x0000ffffa85e4428 <+200>:	cbz	w0, 0xffffa85e4458 <bd_lvm_init+248>
   0x0000ffffa85e442c <+204>:	adrp	x1, 0xffffa85f0000
   0x0000ffffa85e4430 <+208>:	mov	w0, #0x2                   	// #2
   0x0000ffffa85e4434 <+212>:	add	x1, x1, #0xa88
   0x0000ffffa85e4438 <+216>:	bl	0xffffa85e04a0 <bd_utils_log@plt>
   0x0000ffffa85e443c <+220>:	mov	w0, #0x2                   	// #2
   0x0000ffffa85e4440 <+224>:	adrp	x1, 0xffffa85f0000
   0x0000ffffa85e4444 <+228>:	add	x1, x1, #0xac8
   0x0000ffffa85e4448 <+232>:	bl	0xffffa85e04a0 <bd_utils_log@plt>
   0x0000ffffa85e444c <+236>:	ldr	x21, [sp, #48]
   0x0000ffffa85e4450 <+240>:	mov	w0, #0x0                   	// #0
   0x0000ffffa85e4454 <+244>:	b	0xffffa85e43b0 <bd_lvm_init+80>
   0x0000ffffa85e4458 <+248>:	ldr	x0, [x19, #64]
   0x0000ffffa85e445c <+252>:	mov	w1, #0x0                   	// #0
   0x0000ffffa85e4460 <+256>:	bl	0xffffa85e0990 <g_dbus_connection_set_exit_on_close@plt>
   0x0000ffffa85e4464 <+260>:	ldr	x21, [sp, #48]
   0x0000ffffa85e4468 <+264>:	b	0xffffa85e4398 <bd_lvm_init+56>
   0x0000ffffa85e446c <+268>:	ldr	x2, [sp]
   0x0000ffffa85e4470 <+272>:	adrp	x1, 0xffffa85f0000
   0x0000ffffa85e4474 <+276>:	mov	w0, #0x2                   	// #2
   0x0000ffffa85e4478 <+280>:	add	x1, x1, #0xa60
   0x0000ffffa85e447c <+284>:	ldr	x2, [x2, #8]
   0x0000ffffa85e4480 <+288>:	bl	0xffffa85e0260 <bd_utils_log_format@plt>
   0x0000ffffa85e4484 <+292>:	b	0xffffa85e443c <bd_lvm_init+220>
   0x0000ffffa85e4488 <+296>:	str	x21, [sp, #48]
   0x0000ffffa85e448c <+300>:	bl	0xffffa85e0510 <__stack_chk_fail@plt>

so...the top part there looks different from before, but the bottom part looks much the same...we've still got bd_lvm_init , g_dbus_connection_set_exit_on_close , and bd_utils_log_format ... I'm really not sure where we'd still be getting a bd_utils_log_format , though, aside from the other one in setup_dbus_connection that it didn't look like we were hitting...I guess we can *try* modifying that one too, but it doesn't feel right...

Comment 45 Adam Williamson 2024-03-25 22:09:14 UTC
Oh, or does the fact that it now says "Dump of assembler code for function bd_lvm_init:" mean we're crashing in bd_lvm_init rather than in bd_utils_log_format now?

Comment 46 Adam Williamson 2024-03-25 22:13:09 UTC
Created attachment 2023553 [details]
updated coredump with libblockdev-3.1.0-6.fc41

Here is the updated coredump, in case it tells anyone else anything more than it told me.

Comment 47 Dan Horák 2024-03-26 09:17:52 UTC
What hw platform are the compose builders? I could try reserving the same in beaker for further debugging ...

Comment 48 Elliott Sales de Andrade 2024-03-26 11:45:39 UTC
(In reply to Adam Williamson from comment #45)
> Oh, or does the fact that it now says "Dump of assembler code for function
> bd_lvm_init:" mean we're crashing in bd_lvm_init rather than in
> bd_utils_log_format now?

Indeed we are (from your coredump):

(gdb) list *0x0000ffffa85e443c
0xffffa85e443c is in bd_lvm_init (/usr/src/debug/libblockdev-3.1.0-6.fc41.aarch64/src/plugins/lvm-dbus.c:339).
Downloading source file /usr/src/debug/libblockdev-3.1.0-6.fc41.aarch64/src/plugins/lvm-dbus.c
334         GError *error = NULL;                                                                                                                                                                                  
335	
336	    /* the check() call should create the DBus connection for us, but let's not
337	       completely rely on it */
338	    if (G_UNLIKELY (!bus) && !setup_dbus_connection (&error)) {
339	        bd_utils_log (BD_UTILS_LOG_CRIT, "XXX Failed to setup DBus connection");
340	        return FALSE;
341	    }
342	
343	    dm_log_with_errno_init ((dm_log_with_errno_fn) redirect_dm_log);

However, I think `setup_dbus_connection` was inlined, as if you do the previous line from the disassembly, you get the inner debug line:

(gdb) list *0x0000ffffa85e4438
0xffffa85e4438 is in bd_lvm_init (/usr/src/debug/libblockdev-3.1.0-6.fc41.aarch64/src/plugins/lvm-dbus.c:272).
267	                                                  NULL, NULL, error);
268	
269	    g_free (addr);
270	
271	    if (!bus || g_dbus_connection_is_closed (bus)) {
272	        bd_utils_log (BD_UTILS_LOG_CRIT, "XXX Failed to create a new connection for the system bus");
273	        return FALSE;
274	    }
275	
276	    g_dbus_connection_set_exit_on_close (bus, FALSE);

Comment 49 Dan Horák 2024-03-26 13:11:58 UTC
I am thinking about building libblockdev with -O0 to avoid "over-optimization" and similar issues, but the library doesn't want to give up :-) But


diff --git a/libblockdev.spec b/libblockdev.spec
index c5c4b71..dcfca81 100644
--- a/libblockdev.spec
+++ b/libblockdev.spec
@@ -576,7 +576,13 @@ A meta-package that pulls all the libblockdev plugins as dependencies.
 %autosetup -n %{name}-%{version} -p1
 
 %build
+find . -name Makefile.am | xargs sed -i -e 's/-Werror//g'
+
 autoreconf -ivf
+
+%global _fortify_level 0
+%global optflags $(echo %optflags | sed -e 's/-O2/-O0/g') -Wp,-D_FORTIFY_SOURCE=0
+
 %configure %{?configure_opts}
 %{__make} %{?_smp_mflags}
 

seems to do it.

Comment 50 Vojtech Trefny 2024-03-26 13:56:37 UTC
Big thanks for investigating this! I've created this PR for upstream based on your findings: https://github.com/storaged-project/libblockdev/pull/1012

Comment 51 Kevin Fenzi 2024-03-26 14:17:16 UTC
(In reply to Dan Horák from comment #47)
> What hw platform are the compose builders? I could try reserving the same in
> beaker for further debugging ...

The builders in the aarch64 livemedia channel are mostly emags, but 2 cavium thunderX2's.
I have some mt snow I want to add, but haven't yet. 

This issue appears on both the emags and the thunderx2's.

Comment 52 Adam Williamson 2024-03-26 16:51:08 UTC
So...apologies for dumb questions, but what does the info from Elliott's comment tell us exactly? Are the line numbers (339 and 272) significant - are we really somehow still failing on the log lines, even though there's no templating any more? Or is it more complicated than that? Sorry, I am used to Python traces, where at least I pretty much always know exactly where it's breaking. :P

Other than that, I guess I'll do a build with Dan's patch and we can see how the coredumps look after the next compose?

Comment 53 Adam Williamson 2024-03-26 17:21:53 UTC
OK, I'm running a build now that both replaces my dumb XXX patch with Vojtech's upstream PR, and implements Dan's de-optimization changes: https://koji.fedoraproject.org/koji/taskinfo?taskID=115471730 (libblockdev-3.1.1-2.fc41). We'll see how things go with the next compose.

Comment 54 Tomáš Bžatek 2024-03-26 17:28:18 UTC
Another hint: try clang?

Comment 55 Dan Horák 2024-03-26 17:51:55 UTC
(In reply to Tomáš Bžatek from comment #54)
> Another hint: try clang?

yup, and also instrumented build with ubsan and/or asan

Comment 56 Adam Williamson 2024-03-26 22:25:09 UTC
Ooo. It looks like the latest build might have fixed it (or we got really lucky). Workstation and KDE both built OK this time. lxqt and i3 failed, but they did not fail on the SIGILL - they failed much later, due to lack of disk space.

I'm not sure if Vojtech's better fix for the logging issues did the trick, or if somehow the bug goes away with the de-optimization changes, or we just got lucky. I'm going to do another build *with* Vojtech's patch but *without* the de-optimization changes and we can watch the next compose. I also sent a couple of PRs to bump the space for i3 and lxqt, so those two should pass if they don't hit the SIGILL (I hope).

Comment 57 Dan Horák 2024-03-27 12:53:28 UTC
I suspect it passed with -2 due the de-optimization, as it fails again with -3 when the de-op has been removed ...

Comment 58 Adam Williamson 2024-03-27 15:02:26 UTC
Ah, foo, indeed. So, how do you want to proceed? Try with -O1 ?

Comment 59 Dan Horák 2024-03-27 16:23:55 UTC
good question, I would keep Vojtech's patch (looks right to me) and I would add the gcc sanitizer options, start with "undefined behaviour", then try "address", and just for aarch64 to be on the safe side with regard to other arches

--- a/libblockdev.spec
+++ b/libblockdev.spec
@@ -576,7 +576,14 @@ A meta-package that pulls all the libblockdev plugins as dependencies.
 %autosetup -n %{name}-%{version} -p1
 
 %build
 autoreconf -ivf
+
+%ifarch aarch64
+%global optflags %optflags -fsanitize=undefined
+%endif
+
 %configure %{?configure_opts}
 %{__make} %{?_smp_mflags}
 

worst case we might be hitting some CPU design issue (cache not invalidated or such) ... And thus it might be useful to know if moving the jobs to the much newer Mt.Snows would resolve it. But let's start with the sanitizers and maybe better targeted de-optimizing.

Comment 60 Adam Williamson 2024-03-27 16:37:15 UTC
OK, can do. Is the goal here to find a minimal level of de-optimization that makes the bug go away, or to debug it further, or both? Anyway, will apply that change and run another build.

Comment 61 Dan Horák 2024-03-27 16:46:45 UTC
The de-optimization can be done on source file or even on function level (and we have some suspects), so it would serve both, I think. But the goal should be to get to the root cause, but this "remote" debugging doesn't make it easy :-)

Comment 62 Adam Williamson 2024-03-27 17:00:55 UTC
yeah, unfortunately I can't find an easy way to tighten the loop :/ it's not really the remoteness so much as the cycle; it takes about five hours for us to get from start of a compose to success/failure of the aarch64 lives, I think.

Comment 63 Adam Williamson 2024-03-28 15:13:26 UTC
Created attachment 2024016 [details]
another updated coredump with libblockdev 3.1.1-4 (with -fsanitize=undefined)

Comment 64 Adam Williamson 2024-04-04 18:47:55 UTC
Dan: ping? I know we all got distracted by xz :D but did that last coredump help any?

Comment 65 Dan Horák 2024-04-05 08:59:38 UTC
(In reply to Adam Williamson from comment #64)
> Dan: ping? I know we all got distracted by xz :D but did that last coredump
> help any?

not really :-( I was hoping for some output from the "undefined behaviour" sanitizer, my next step is address sanitizer in libblockdev-3.1.1-5.fc41

Comment 66 Kevin Fenzi 2024-04-06 01:16:23 UTC
So... that didn't work at all. ;) 

https://koji.fedoraproject.org/koji/taskinfo?taskID=115944106

and the screenshot has a dead anaconda with: 

"==2499==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD"

I have untagged it for now so we can get a compose...

Comment 67 Dan Horák 2024-04-08 07:59:17 UTC
Hmm, that's unfortunate :-( I will revert the spec back to "normal" build before any next steps.

Comment 68 Adam Williamson 2024-04-11 19:14:53 UTC
I'm thinking that for the next Fedora 40 Final RC, at least, we might want to put in a 'detuned' (per comment #49) libblockdev build so we get all the images. It's not the correct fix, of course, but it *does* avoid the bug in practice.

Doing this would mean we'd have to ship the detuned libblockdev build as part of the release, though, and of course it would be in *all* images in the compose, not just used to build the affected aarch64 live images. How bad is the performance impact of losing those optimizations? Do you think it's bad enough that this would be a bad plan, Dan?

Comment 69 Adam Williamson 2024-04-11 19:15:39 UTC
proposing as FinalFE for that purpose.

Comment 70 Adam Williamson 2024-04-11 20:11:20 UTC
+3 in https://pagure.io/fedora-qa/blocker-review/issue/1588 , marking accepted FE. this means we *can* include a detuned build in the next compose, but we'll still wait for Dan's opinion on whether we *should*.

Comment 71 Dan Horák 2024-04-11 20:35:26 UTC
I believe libblockdev isn't performance critical, so it makes sense to do a de-optimized (on aarch64 only) build. Let me prepare a rawhide build ASAP. Adam, feel free to issue a F-40 too, seems it can't be fast-forwarded, but a cherry-pick into the f40 branch should work.

Unfortunately I am mostly out of ideas about the next steps to get to the root cause :-(

Comment 72 Adam Williamson 2024-04-11 20:41:58 UTC
oh, yeah, I forgot we can apply the deoptimization only on aarch64, which limits the impact. cool. Yeah, go ahead and do a Rawhide build and I'll take care of F40.

Comment 73 Dan Horák 2024-04-11 20:56:29 UTC
rawhide build is https://koji.fedoraproject.org/koji/taskinfo?taskID=116239319

And for the next step I will try to de-optimize just the "utils" internal library ...

Comment 74 Kevin Fenzi 2024-04-12 17:59:30 UTC
So, 3 failed aarch64 live's with that build. ;( 

https://koji.fedoraproject.org/koji/taskinfo?taskID=116257542
https://koji.fedoraproject.org/koji/taskinfo?taskID=116257533
https://koji.fedoraproject.org/koji/taskinfo?taskID=116257552

Adding Florian here in case he has ideas or knows anyone in tools space that might be able to see the problem more clearly than we...

Comment 75 Florian Weimer 2024-04-12 18:26:39 UTC
Most of the disassembly here is for pthread_kill, which is explicitly called for signal 4, which is SIGILL. This must come from the (Anaconda?) crash handler. These in-process crash handlers are best avoided because they may obfuscate what's going on. A coredump without a crash handler (such as captured by systemd-coredump) may have more information.

The real crash is in the frame that comes in a GDB backtrace right after the line with <signal handler called>. The fault address 0x0000ffffa85e443c seems bogus because that instruction cannot fault. So unless something rewrites the file in place, I don't see how this can crash.

Comment 76 Dan Horák 2024-04-12 18:30:32 UTC
(In reply to Kevin Fenzi from comment #74)
> So, 3 failed aarch64 live's with that build. ;( 
> 
> https://koji.fedoraproject.org/koji/taskinfo?taskID=116257542
> https://koji.fedoraproject.org/koji/taskinfo?taskID=116257533
> https://koji.fedoraproject.org/koji/taskinfo?taskID=116257552
> 
> Adding Florian here in case he has ideas or knows anyone in tools space that
> might be able to see the problem more clearly than we...

FWIW only the latter 2 failed due the SIGILL if I see right ...

Comment 77 Adam Williamson 2024-04-12 19:19:21 UTC
Florian: I'm pretty sure I got the coredumps I attached by using `coredumpctl dump`, which is from systemd-coredump, isn't it?

Comment 78 Florian Weimer 2024-04-12 19:41:30 UTC
(In reply to Adam Williamson from comment #77)
> Florian: I'm pretty sure I got the coredumps I attached by using
> `coredumpctl dump`, which is from systemd-coredump, isn't it?

Yes, but some of the backtraces seem to have the crash handler on the stack, rather than the actual stack contents at the crash site.

Comment 79 Kevin Fenzi 2024-04-13 17:28:02 UTC
So, branched compose failed last night... workstation live failed, but this time it was not only the aarch64 one, but also the x86_64 one. ;( 

https://koji.fedoraproject.org/koji/taskinfo?taskID=116308658

Sadly, we don't have the core there. We never enabled it for x86_64 builders, only aarch64 buildhw's.

Comment 80 Florian Weimer 2024-04-13 18:18:30 UTC
This is really odd. The circumstances for SIGILL are not exactly the same between AArch64 and x86-64. For example, a GCC-generated trap results in SIGILL on x86-64, but on AArch64 it's SIGTRAP.