Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 2247319
Summary: | sporadic sigill with python3.12 in livemedia composes | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Kevin Fenzi <kevin> | ||||||
Component: | libblockdev | Assignee: | Vojtech Trefny <vtrefny> | ||||||
Status: | NEW --- | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 39 | CC: | awilliam, cstratak, dan, dennis, fweimer, mhroncok, python-packagers-sig, quantum.analyst, robatino, tbzatek, thrnciar, v.podzimek+fedora, vtrefny, zbyszek | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | aarch64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | AcceptedFreezeException | ||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | Type: | --- | |||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 245418, 2187795 | ||||||||
Attachments: |
|
Description
Kevin Fenzi
2023-10-31 20:32:31 UTC
I have no idea where would I start investigating this or how. Is it possible to download the filesystem of the builder and boot it up in some VM? Yeah, I am really not sure how to debug this either. :( Perhaps we could first cc the anaconda folks and see if they see anything like it on their aarch64 testing/ci? All of these are on bare metal aarch64 builders, with mock / systemd-nspawn and livemedia-creator running in that. I could also try and duplicate it in staging, but probibly not this week... Oh, I forgot, another thing I'd like to do is update/reboot those builders... but we are in f39 freeze and not making any changes there, but can hopefully soon... can we get the coredump and backtrace it at least? There isn't one. At least nothing in coredumpctl. ;( I am not sure why. it's ulimit, I think: Oct 31 09:49:01 buildhw-a64-02.iad2.fedoraproject.org systemd-coredump[1773132]: Resource limits disable core dumping for process 1773084 (python3). we should be able to change the limits and get a coredump. If you can get a coredump, I may be able to help you to debug it. "python3 -m pyanaconda.modules.storage": is it easy to reproduce the issue? Do you think that it would be possible to create a special image containing python3-debug and run anaconda programs (such as "python3 -m pyanaconda.modules.storage") with this program instead? Like run "python3-debug -X dev -m pyanaconda.modules.storage" instead. Adding "-X dev" enables more debug checks: * -X dev: https://docs.python.org/dev/library/devmode.html * python3-debug is a Python debug build: https://docs.python.org/dev/using/configure.html#python-debug-build we're working on getting a coredump - see https://pagure.io/fedora-infra/ansible/pull-request/1618 . we might be able to do a custom image also, I can look into that later. Any luck getting a coredump since the PR has been merged? Sadly no. It didn't work. ;( I'll try and figure out another way... ok. Got a core finally... PID: 3558463 (python3) UID: 0 (root) GID: 0 (root) Signal: 4 (ILL) Timestamp: Tue 2023-11-14 19:12:16 UTC (4min 15s ago) Command Line: python3 -m pyanaconda.modules.payloads Executable: /usr/bin/python3.12 Control Group: /machine.slice/machine-a419a95f068a4ded988be5153ba67666.scope/payload Unit: machine-a419a95f068a4ded988be5153ba67666.scope Slice: machine.slice Boot ID: 45eb1ecb91fe45fc9c7bda79e918763c Machine ID: 5d8add178f344e378788cb2550e6a521 Hostname: a419a95f068a4ded988be5153ba67666 Storage: /var/lib/systemd/coredump/core.python3.0.45eb1ecb91fe45fc9c7bda79e918763c.3558463.1699989136000000.zst (present) Size on Disk: 2.7M Package: python3.12/3.12.0-2.fc40 build-id: b9bd80861442b62dc97210c2bd0334084f6c734c Message: Process 3558463 (python3) of user 0 dumped core. Module /usr/bin/python3.12 from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/libdevmapper.so.1.02 from rpm lvm2-2.03.22-1.fc39.aarch64 Module /usr/lib64/libbd_lvm-dbus.so.3.0.0 from rpm libblockdev-3.0.4-1.fc40.aarch64 Module /usr/lib64/python3.12/site-packages/selinux/_selinux.cpython-312-aarch64-linux-gnu.so from rpm libselinux-3.5-5.fc39.aarch64 Module /usr/lib64/libzstd.so.1.5.5 from rpm zstd-1.5.5-4.fc39.aarch64 Module /usr/lib64/libcap.so.2.48 from rpm libcap-2.48-7.fc39.aarch64 Module /usr/lib64/libkmod.so.2.4.1 from rpm kmod-31-2.fc40.aarch64 Module /usr/lib64/libudev.so.1.7.7 from rpm systemd-254.5-2.fc40.aarch64 Module /usr/lib64/libbd_utils.so.3.0.0 from rpm libblockdev-3.0.4-1.fc40.aarch64 Module /usr/lib64/libblockdev.so.3.0.0 from rpm libblockdev-3.0.4-1.fc40.aarch64 Module /usr/lib64/libgmp.so.10.4.1 from rpm gmp-6.2.1-5.fc39.aarch64 Module /usr/lib64/libmpfr.so.6.2.1 from rpm mpfr-4.2.1-1.fc40.aarch64 Module /usr/lib64/libbytesize.so.1.0.0 from rpm libbytesize-2.10-1.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/_ctypes.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/_opcode.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/array.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/_socket.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/libblkid.so.1.1.0 from rpm util-linux-2.39.2-1.fc40.aarch64 Module /usr/lib64/libselinux.so.1 from rpm libselinux-3.5-5.fc39.aarch64 Module /usr/lib64/libmount.so.1.1.0 from rpm util-linux-2.39.2-1.fc40.aarch64 Module /usr/lib64/libgio-2.0.so.0.7800.1 from rpm glib2-2.78.1-1.fc40.aarch64 Module /usr/lib64/libgmodule-2.0.so.0.7800.1 from rpm glib2-2.78.1-1.fc40.aarch64 Module /usr/lib64/libpcre2-8.so.0.11.2 from rpm pcre2-10.42-2.fc40.aarch64 Module /usr/lib64/libffi.so.8.1.2 from rpm libffi-3.4.4-4.fc39.aarch64 Module /usr/lib64/libgobject-2.0.so.0.7800.1 from rpm glib2-2.78.1-1.fc40.aarch64 Module /usr/lib64/libgirepository-1.0.so.1.0.0 from rpm gobject-introspection-1.78.1-1.fc40.aarch64 Module /usr/lib64/libglib-2.0.so.0.7800.1 from rpm glib2-2.78.1-1.fc40.aarch64 Module /usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so from rpm pygobject3-3.46.0-1.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/_struct.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/libmpdec.so.2.5.1 from rpm mpdecimal-2.5.1-7.fc39.aarch64 Module /usr/lib64/python3.12/lib-dynload/_decimal.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/libuuid.so.1.3.0 from rpm util-linux-2.39.2-1.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/_uuid.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/libb2.so.1.0.4 from rpm libb2-0.98.1-9.fc39.aarch64 Module /usr/lib64/python3.12/lib-dynload/_blake2.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/libcrypto.so.3.1.4 from rpm openssl-3.1.4-1.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/_hashlib.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/_random.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/_bisect.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/liblzma.so.5.4.5 from rpm xz-5.4.5-1.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/_lzma.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/libbz2.so.1.0.8 from rpm bzip2-1.0.8-16.fc39.aarch64 Module /usr/lib64/python3.12/lib-dynload/_bz2.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/libz.so.1.2.13 from rpm zlib-1.2.13-5.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/zlib.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/math.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/select.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/_posixsubprocess.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/fcntl.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/syslog.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/libpython3.12.so.1.0 from rpm python3.12-3.12.0-2.fc40.aarch64 Stack trace of thread 53: #0 0x0000ffffa2df1c80 n/a (/usr/lib64/libc.so.6 + 0x91c80) #1 0x0000ffffa2df1c6c n/a (/usr/lib64/libc.so.6 + 0x91c6c) #2 0x0000ffffa2da5800 n/a (/usr/lib64/libc.so.6 + 0x45800) #3 0x0000ffffa35de7f0 n/a (ld-linux-aarch64.so.1 + 0x4d7f0) #4 0x0000ffff93372cd8 n/a (/usr/lib64/libbd_utils.so.3.0.0 + 0x12cd8) #5 0x0000ffff930b4438 n/a (/usr/lib64/libbd_lvm-dbus.so.3.0.0 + 0x14438) #6 0x0000ffff933c1680 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x11680) #7 0x0000ffff933ccba0 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x1cba0) #8 0x0000ffff933cf958 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x1f958) #9 0x0000ffff933d6ee8 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x26ee8) #10 0x0000ffff93ecc050 n/a (/usr/lib64/libffi.so.8.1.2 + 0x1c050) #11 0x0000ffff93ec63ac n/a (/usr/lib64/libffi.so.8.1.2 + 0x163ac) #12 0x0000ffff941c1344 n/a (/usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so + 0x31344) #13 0x0000ffff941bfe4c n/a (/usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so + 0x2fe4c) #14 0x0000ffffa30aa008 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x17a008) #15 0x0000ffffa30b6518 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x186518) #16 0x0000ffffa316bf48 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x23bf48) #17 0x0000ffffa3193c04 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x263c04) #18 0x0000ffffa30d0bd0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a0bd0) #19 0x0000ffffa30b7868 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x187868) #20 0x0000ffffa30d9ed0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a9ed0) #21 0x0000ffffa3105b98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d5b98) #22 0x0000ffffa310516c n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d516c) #23 0x0000ffffa3113fc4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1e3fc4) #24 0x0000ffffa30d0bd0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a0bd0) #25 0x0000ffffa30b7868 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x187868) #26 0x0000ffffa30d9ed0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a9ed0) #27 0x0000ffffa3105b98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d5b98) #28 0x0000ffffa310516c n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d516c) #29 0x0000ffffa30ba964 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18a964) #30 0x0000ffffa316bf48 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x23bf48) #31 0x0000ffffa3193c04 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x263c04) #32 0x0000ffffa30d0bd0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a0bd0) #33 0x0000ffffa30b7868 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x187868) #34 0x0000ffffa30d9ed0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a9ed0) #35 0x0000ffffa3105b98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d5b98) #36 0x0000ffffa310516c n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d516c) #37 0x0000ffffa30ba964 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18a964) #38 0x0000ffffa316bf48 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x23bf48) #39 0x0000ffffa3193c04 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x263c04) #40 0x0000ffffa30d0bd0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a0bd0) #41 0x0000ffffa30b7868 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x187868) #42 0x0000ffffa30d9ed0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a9ed0) #43 0x0000ffffa3105b98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d5b98) #44 0x0000ffffa310516c n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d516c) #45 0x0000ffffa30ba964 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18a964) #46 0x0000ffffa316bf48 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x23bf48) #47 0x0000ffffa3193c04 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x263c04) #48 0x0000ffffa30d0bd0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a0bd0) #49 0x0000ffffa30d0678 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a0678) #50 0x0000ffffa30b6518 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x186518) #51 0x0000ffffa31acc40 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x27cc40) #52 0x0000ffffa31abefc n/a (/usr/lib64/libpython3.12.so.1.0 + 0x27befc) #53 0x0000ffffa3151c34 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x221c34) #54 0x0000ffffa2d909dc n/a (/usr/lib64/libc.so.6 + 0x309dc) #55 0x0000ffffa2d90ab0 n/a (/usr/lib64/libc.so.6 + 0x30ab0) #56 0x0000aaaad11f00f0 n/a (/usr/bin/python3.12 + 0x100f0) ELF object binary architecture: AARCH64 The file is at https://people.fedoraproject.org/~kevin/core.python3.0.45eb1ecb91fe45fc9c7bda79e918763c.3558463.1699989136000000.zst let me know if I can add anymore. Kevin, could you get an output of "coredumpctl gdb"? ok. Here's what I got... Also, FYI, we are doing a update/reboot cycle tomorrow, so we can see if this even persists after updates... Core was generated by `python3 -m pyanaconda.modules.payloads'. Program terminated with signal SIGILL, Illegal instruction. #0 __pthread_kill_implementation (threadid=281473422540832, signo=signo@entry=4, no_tid=no_tid@entry=0) at pthread_kill.c:44 Downloading source file /usr/src/debug/glibc-2.38.9000-19.fc40.aarch64/nptl/pthread_kill.c 44 return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0; [Current thread is 1 (LWP 53)] Missing separate debuginfos, use: dnf debuginfo-install bzip2-libs-1.0.8-13.fc38.aarch64 devic e-mapper-libs-1.02.189-2.fc38.aarch64 glib2-2.76.5-2.fc38.aarch64 glibc-2.37-10.fc38.aarch64 g mp-6.2.1-4.fc38.aarch64 gobject-introspection-1.76.1-1.fc38.aarch64 kmod-libs-30-4.fc38.aarch6 4 libb2-0.98.1-8.fc38.aarch64 libblkid-2.38.1-4.fc38.aarch64 libbytesize-2.9-1.fc38.aarch64 li bcap-2.48-6.fc38.aarch64 libffi-3.4.4-2.fc38.aarch64 libgcc-13.2.1-1.fc38.aarch64 libgomp-13.2 .1-1.fc38.aarch64 libmount-2.38.1-4.fc38.aarch64 libselinux-3.5-1.fc38.aarch64 libuuid-2.38.1- 4.fc38.aarch64 libzstd-1.5.5-1.fc38.aarch64 mpdecimal-2.5.1-6.fc38.aarch64 mpfr-4.1.1-3.fc38.a arch64 openssl-libs-3.0.9-2.fc38.aarch64 pcre2-10.42-1.fc38.1.aarch64 systemd-libs-253.10-1.fc 38.aarch64 xz-libs-5.4.1-1.fc38.aarch64 zlib-1.2.13-3.fc38.aarch64 (gdb) thread apply all bt full Thread 3 (LWP 69): #0 0x0000ffffa2e4d6e4 in __GI___poll (fds=0xaaaaf7fdf790, nfds=1, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41 _x3tmp = 0 _x0tmp = 187651281778576 _x0 = 187651281778576 _x3 = 0 _x4tmp = 0 _x1tmp = 1 _x1 = 1 _x4 = 0 _x2tmp = <optimized out> _x2 = 0 _x8 = 73 _sys_result = <optimized out> sc_ret = <optimized out> sc_cancel_oldtype = 0 sc_ret = <optimized out> timeout_ts = {tv_sec = 187651281778064, tv_nsec = 187651281778064} timeout_ts_p = <optimized out> #1 0x0000ffff940be5b0 in g_main_context_poll_unlocked (priority=2147483647, n_fds=1, fds=0xaa aaf7fdf790, timeout=<optimized out>, context=0xaaaaf7fdf590) at ../glib/gmain.c:4653 ret = <optimized out> errsv = <optimized out> poll_func = 0xffff9406e2c0 <g_poll> poll_func = <optimized out> ret = <optimized out> errsv = <optimized out> #2 g_main_context_iterate_unlocked.isra.0 (context=context@entry=0xaaaaf7fdf590, block=block@ entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ../glib/gmain.c:4344 max_priority = 2147483647 timeout = -1 some_ready = <optimized out> nfds = 1 allocated_nfds = 1 fds = 0xaaaaf7fdf790 begin_time_nsec = <optimized out> #3 0x0000ffff9405db18 in g_main_context_iteration (context=0xaaaaf7fdf590, may_block=may_bloc k@entry=1) at ../glib/gmain.c:4414 retval = <optimized out> #4 0x0000ffff9405db74 in glib_worker_main (data=<optimized out>) at ../glib/gmain.c:6574 No locals. #5 0x0000ffff940944c4 in g_thread_proxy (data=0xaaaaf7fdf730) at ../glib/gthread.c:831 thread = 0xaaaaf7fdf730 __func__ = "g_thread_proxy" #6 0x0000ffffa2deff84 in start_thread (arg=0xffffa35d4760) at pthread_create.c:444 ret = <optimized out> pd = 0xffffa35d4760 out = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {281473139601792, 281473139601792, 48, 2814 73422542688, 281473414266048, 0, 281473131151360, 281473415512064, 281473422540832, 2814731311 51360, 281473139599888, 11276355813591266238, 0, 11276355813333378930, 0, 0, 0, 0, 0, 0, 0, 0} , mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = 0 #7 0x0000ffffa2e58dcc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone3.S:76 No locals. Thread 2 (LWP 66): --Type <RET> for more, q to quit, c to continue without paging--c #0 syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38 No locals. #1 0x0000ffff940badcc in g_cond_wait (cond=<optimized out>, mutex=0xaaaaf7fd54d0) at ../glib/ gthread-posix.c:1552 saved_errno = 0 res = <optimized out> sampled = 0 #2 0x0000ffff940293e8 in g_async_queue_pop_intern_unlocked (queue=0xaaaaf7fd54d0, wait=1, end _time=-1) at ../glib/gasyncqueue.c:425 retval = <optimized out> __func__ = "g_async_queue_pop_intern_unlocked" #3 0x0000ffff9409669c in g_thread_pool_spawn_thread (data=<optimized out>) at ../glib/gthread pool.c:311 spawn_thread_data = <optimized out> thread = 0x0 error = 0x0 prgname = <optimized out> name = "pool", '\000' <repeats 11 times> #4 0x0000ffff940944c4 in g_thread_proxy (data=0xaaaaf7e34c20) at ../glib/gthread.c:831 thread = 0xaaaaf7e34c20 __func__ = "g_thread_proxy" #5 0x0000ffffa2deff84 in start_thread (arg=0xffffa35d4760) at pthread_create.c:444 ret = <optimized out> pd = 0xffffa35d4760 out = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {281473148055936, 281473148055936, 48, 2814 73422542688, 281473414266048, 0, 281473139605504, 281473415512064, 281473422540832, 2814731396 05504, 281473148054032, 11276355813591266238, 0, 11276355813353629554, 0, 0, 0, 0, 0, 0, 0, 0} , mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = 0 #6 0x0000ffffa2e58dcc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone3.S:76 No locals. Thread 1 (LWP 53): #0 __pthread_kill_implementation (threadid=281473422540832, signo=signo@entry=4, no_tid=no_ti d@entry=0) at pthread_kill.c:44 tid = 53 ret = 0 pd = 0xffffa35d4020 old_mask = {__val = {0}} ret = <optimized out> #1 0x0000ffffa2df1cf8 in __pthread_kill_internal (signo=4, threadid=<optimized out>) at pthre ad_kill.c:78 No locals. #2 0x0000ffffa2da5800 in __GI_raise (sig=4) at ../sysdeps/posix/raise.c:26 ret = <optimized out> #3 <signal handler called> No symbol table info available. #4 0x0000ffffa359cfe0 in ?? () No symbol table info available. #5 0x0000aaaaf7e28e80 in ?? () No symbol table info available. Backtrace stopped: previous frame inner to this frame (corrupt stack?) > Missing separate debuginfos, use: dnf debuginfo-install (...) Oh, the debuginfod server was not enabled? https://fedoraproject.org/wiki/Debuginfod You may install these debug symbols to get a more useful backtrace. Sadly, the most interesting thread has an unusable bactrace: Thread 1 (LWP 53): ... #3 <signal handler called> No symbol table info available. #4 0x0000ffffa359cfe0 in ?? () No symbol table info available. ... Apparently, the crash is related to libblockdev, libbd_lvm-dbus and libbd_utils: Stack trace of thread 53: #0 0x0000ffffa2df1c80 n/a (/usr/lib64/libc.so.6 + 0x91c80) #1 0x0000ffffa2df1c6c n/a (/usr/lib64/libc.so.6 + 0x91c6c) #2 0x0000ffffa2da5800 n/a (/usr/lib64/libc.so.6 + 0x45800) #3 0x0000ffffa35de7f0 n/a (ld-linux-aarch64.so.1 + 0x4d7f0) #4 0x0000ffff93372cd8 n/a (/usr/lib64/libbd_utils.so.3.0.0 + 0x12cd8) #5 0x0000ffff930b4438 n/a (/usr/lib64/libbd_lvm-dbus.so.3.0.0 + 0x14438) #6 0x0000ffff933c1680 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x11680) #7 0x0000ffff933ccba0 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x1cba0) #8 0x0000ffff933cf958 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x1f958) #9 0x0000ffff933d6ee8 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x26ee8) #10 0x0000ffff93ecc050 n/a (/usr/lib64/libffi.so.8.1.2 + 0x1c050) #11 0x0000ffff93ec63ac n/a (/usr/lib64/libffi.so.8.1.2 + 0x163ac) #12 0x0000ffff941c1344 n/a (/usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so + 0x31344) #13 0x0000ffff941bfe4c n/a (/usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so + 0x2fe4c) From what I see, the problem comes more likely from libblockdev than from Python. I reassign the issue to libblockdev. > debuginfod server was not enabled?
It was. This builder hasn't had updates applied in a while (because we were in freeze before f39 release), so likely all those things have updated and the debuginfo for the older packages it has are not around anymore? in any case it did download a bunch of things, just not everything. ;(
If someone needs access to an aarch64 system with rawhide, let us know. Bellow is a gdb session showing the disassembly of the function containing 0x0000ffff93372cd8 where the crash likely happens. The missing backtrace points to some memory/stack corruption (eg. data treated as an instruction) ... [root@hpe-apollo-cn99xx-15-vm-30 ~]# gdb --quiet /usr/bin/python3.12 core.python3.0.45eb1ecb91fe45fc9c7bda79e918763c.3558463.1699989136000000 Reading symbols from /usr/bin/python3.12... This GDB supports auto-downloading debuginfo from the following URLs: <https://debuginfod.fedoraproject.org/> Enable debuginfod for this session? (y or [n]) y Debuginfod has been enabled. To make this setting permanent, add 'set debuginfod enabled on' to .gdbinit. Reading symbols from /root/.cache/debuginfod_client/b9bd80861442b62dc97210c2bd0334084f6c734c/debuginfo... warning: Can't open file /usr/lib/locale/locale-archive during file-backed mapping note processing [New LWP 53] [New LWP 66] [New LWP 69] warning: Could not load shared library symbols for /usr/lib64/python3.12/site-packages/selinux/_selinux.cpython-312-aarch64-linux-gnu.so. Do you need "set solib-search-path" or "set sysroot"? [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `python3 -m pyanaconda.modules.payloads'. Program terminated with signal SIGILL, Illegal instruction. #0 __pthread_kill_implementation (threadid=281473422540832, signo=signo@entry=4, no_tid=no_tid@entry=0) at pthread_kill.c:44 44 return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0; [Current thread is 1 (Thread 0xffffa35d4020 (LWP 53))] (gdb) disass 0x0000ffff93372cd8 Dump of assembler code for function bd_utils_log_format: 0x0000ffff93372be0 <+0>: paciasp 0x0000ffff93372be4 <+4>: sub sp, sp, #0x120 0x0000ffff93372be8 <+8>: adrp x8, 0xffff9339f000 0x0000ffff93372bec <+12>: ldr x8, [x8, #4056] 0x0000ffff93372bf0 <+16>: stp x29, x30, [sp, #80] 0x0000ffff93372bf4 <+20>: add x29, sp, #0x50 0x0000ffff93372bf8 <+24>: stp x19, x20, [sp, #96] 0x0000ffff93372bfc <+28>: adrp x20, 0xffff933a0000 <log_level> 0x0000ffff93372c00 <+32>: mov w19, w0 0x0000ffff93372c04 <+36>: str q0, [sp, #112] 0x0000ffff93372c08 <+40>: str q1, [sp, #128] 0x0000ffff93372c0c <+44>: str q2, [sp, #144] 0x0000ffff93372c10 <+48>: str q3, [sp, #160] 0x0000ffff93372c14 <+52>: str q4, [sp, #176] 0x0000ffff93372c18 <+56>: str q5, [sp, #192] 0x0000ffff93372c1c <+60>: str q6, [sp, #208] 0x0000ffff93372c20 <+64>: str q7, [sp, #224] 0x0000ffff93372c24 <+68>: stp x2, x3, [sp, #240] 0x0000ffff93372c28 <+72>: stp x4, x5, [sp, #256] 0x0000ffff93372c2c <+76>: stp x6, x7, [sp, #272] 0x0000ffff93372c30 <+80>: ldr x0, [x20, #8] 0x0000ffff93372c34 <+84>: ldr x2, [x8] 0x0000ffff93372c38 <+88>: str x2, [sp, #72] 0x0000ffff93372c3c <+92>: mov x2, #0x0 // #0 0x0000ffff93372c40 <+96>: str xzr, [sp, #32] 0x0000ffff93372c44 <+100>: cbz x0, 0xffff93372ce0 <bd_utils_log_format+256> 0x0000ffff93372c48 <+104>: adrp x2, 0xffff933a0000 <log_level> 0x0000ffff93372c4c <+108>: mov x0, #0x0 // #0 0x0000ffff93372c50 <+112>: ldr w2, [x2] 0x0000ffff93372c54 <+116>: cmp w2, w19 0x0000ffff93372c58 <+120>: b.ge 0xffff93372c90 <bd_utils_log_format+176> // b.tcont 0x0000ffff93372c5c <+124>: bl 0xffff93370060 <g_free@plt> 0x0000ffff93372c60 <+128>: adrp x0, 0xffff9339f000 0x0000ffff93372c64 <+132>: ldr x0, [x0, #4056] 0x0000ffff93372c68 <+136>: ldr x2, [sp, #72] 0x0000ffff93372c6c <+140>: ldr x1, [x0] 0x0000ffff93372c70 <+144>: subs x2, x2, x1 0x0000ffff93372c74 <+148>: mov x1, #0x0 // #0 0x0000ffff93372c78 <+152>: b.ne 0xffff93372cf4 <bd_utils_log_format+276> // b.any 0x0000ffff93372c7c <+156>: ldp x29, x30, [sp, #80] 0x0000ffff93372c80 <+160>: ldp x19, x20, [sp, #96] 0x0000ffff93372c84 <+164>: add sp, sp, #0x120 0x0000ffff93372c88 <+168>: autiasp 0x0000ffff93372c8c <+172>: ret 0x0000ffff93372c90 <+176>: mov w3, #0xffffff80 // #-128 0x0000ffff93372c94 <+180>: mov w0, #0xffffffd0 // #-48 0x0000ffff93372c98 <+184>: stp w0, w3, [sp, #64] 0x0000ffff93372c9c <+188>: add x3, sp, #0x28 0x0000ffff93372ca0 <+192>: add x2, sp, #0x120 0x0000ffff93372ca4 <+196>: add x4, sp, #0xf0 0x0000ffff93372ca8 <+200>: stp x2, x2, [sp, #40] 0x0000ffff93372cac <+204>: mov x2, sp 0x0000ffff93372cb0 <+208>: add x0, sp, #0x20 0x0000ffff93372cb4 <+212>: str x4, [sp, #56] 0x0000ffff93372cb8 <+216>: ldp q0, q1, [x3] 0x0000ffff93372cbc <+220>: stp q0, q1, [x2] 0x0000ffff93372cc0 <+224>: bl 0xffff93370760 <g_vasprintf@plt> 0x0000ffff93372cc4 <+228>: tbnz w0, #31, 0xffff93372ce8 <bd_utils_log_format+264> 0x0000ffff93372cc8 <+232>: ldr x1, [sp, #32] 0x0000ffff93372ccc <+236>: mov w0, w19 0x0000ffff93372cd0 <+240>: ldr x2, [x20, #8] 0x0000ffff93372cd4 <+244>: blr x2 0x0000ffff93372cd8 <+248>: ldr x0, [sp, #32] 0x0000ffff93372cdc <+252>: b 0xffff93372c5c <bd_utils_log_format+124> 0x0000ffff93372ce0 <+256>: mov x0, #0x0 // #0 0x0000ffff93372ce4 <+260>: b 0xffff93372c5c <bd_utils_log_format+124> 0x0000ffff93372ce8 <+264>: ldr x0, [sp, #32] 0x0000ffff93372cec <+268>: bl 0xffff93370060 <g_free@plt> 0x0000ffff93372cf0 <+272>: b 0xffff93372c60 <bd_utils_log_format+128> 0x0000ffff93372cf4 <+276>: bl 0xffff93370470 <__stack_chk_fail@plt> End of assembler dump. Is bd_utils_log_format called with an invalid format string and/or args? It is a wrapper around vasprintf, so with a bad format string it could scribble over memory. Does anyone know how to figure out where bd_utils_log_format was called? There are only 33 places in libblockdev that call bd_utils_log_format with a format specifier (always either %s or %d). Well, on one line, anyway. So...that might not be too many to just go through them and look for suspicious ones... [adamw@xps13a libblockdev (master)]$ grep -R bd_utils_log_format | grep % scripts/boilerplate_generator.py: ret += ' bd_utils_log_format (BD_UTILS_LOG_WARNING, "failed to load module {0}: %s", dlerror());\n'.format(module_name) scripts/boilerplate_generator.py: ret += ' bd_utils_log_format (BD_UTILS_LOG_DEBUG, "failed to load the init() function for {0}: %s", error);\n'.format(module_name) scripts/boilerplate_generator.py: ret += ' bd_utils_log_format (BD_UTILS_LOG_WARNING, "failed to load {0.name}: %s", error);\n\n'.format(info) scripts/boilerplate_generator.py: ret += ' bd_utils_log_format (BD_UTILS_LOG_DEBUG, "failed to load the close_plugin() function for {0}: %s", error);\n'.format(module_name) src/lib/blockdev.c.in: bd_utils_log_format (BD_UTILS_LOG_WARNING, "Cannot process the config file '%s': %s. Skipping.", src/lib/blockdev.c.in: bd_utils_log_format (BD_UTILS_LOG_WARNING, "Failed to load config files: %s. Using the built-in config", src/lib/blockdev.c.in: bd_utils_log_format (BD_UTILS_LOG_INFO, "Failed to load config files: %s. Using the built-in config", src/plugins/btrfs.c: bd_utils_log_format (BD_UTILS_LOG_WARNING, "%s", error->msg); src/plugins/btrfs.c: bd_utils_log_format (BD_UTILS_LOG_WARNING, "%s", error->msg); src/plugins/btrfs.c: bd_utils_log_format (BD_UTILS_LOG_WARNING, "%s", error->msg); src/plugins/crypto.c: bd_utils_log_format (BD_UTILS_LOG_WARNING, "Unknown cryptsetup log level %d.", level); src/plugins/crypto.c: bd_utils_log_format (BD_UTILS_LOG_ERR, "Failed to deactivate temporary device %s", tmp_name); src/plugins/crypto.c: bd_utils_log_format (BD_UTILS_LOG_ERR, "Failed to deactivate temporary device %s", tmp_name); src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to get system bus address: %s\n", (*error)->message); src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to create a new connection for the system bus: %s\n", (*error)->message); src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to setup DBus connection: %s", error->message); src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to flush DBus connection: %s", error->message); src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to close DBus connection: %s", error->message); src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Got error when getting progress: %s", l_error->message); src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Failed to get VG properties for PV %s: %s", src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO operating mode: %s", value); src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO compression state: %s", value); src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO index state: %s", value); src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO write policy: %s", value); src/plugins/lvm.c: bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO operating mode: %s", value); src/plugins/lvm.c: bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO compression state: %s", value); src/plugins/lvm.c: bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO index state: %s", value); src/plugins/lvm.c: bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO write policy: %s", value); src/plugins/nvme/nvme-info.c: bd_utils_log_format (BD_UTILS_LOG_WARNING, "Unhandled self-test log entry result code: %d", dsts); src/plugins/nvme/nvme-info.c: bd_utils_log_format (BD_UTILS_LOG_WARNING, "Unhandled self-test log entry action code: %d", code); src/plugins/part.c: bd_utils_log_format (BD_UTILS_LOG_INFO, "Not resizing, partition '%s' is already at its maximum size.", part); src/plugins/s390.c: bd_utils_log_format (BD_UTILS_LOG_WARNING, "Device %s status is %s, needs dasdfmt.", dasd, status); src/plugins/s390.c: bd_utils_log_format (BD_UTILS_LOG_WARNING, "Device %s is already online", devno); another dumb idea: do a custom libblockdev build that just adds a silly "XXX I AM LINE 123 OF FILE.C ABOUT TO LOG SOMETHING OH BOY!" printf before each of those, then run some scratch koji tasks with that build of libblockdev, and hope that tells us which bd_utils_log_format call is the problematic one... Hmm, isn't the compiler supposed to warn on invalid types supplied to printf-like funcs? Perhaps the > void bd_utils_log_format (gint level, const gchar *format, ...) { definition is too vague and should be transformed to something like > void bd_utils_log_format (gint level, const gchar *format) G_GNUC_FORMAT (2) { (which expands to '__attribute__ ((format_arg (2)))') to get proper annotations. However that requires a string return value which is not the case here. Anyway, looking at lvm-dbus.c, the worst case of a bad value is passed in is NULL which for "%s" should be gracefully translated to "(null)". Except... https://github.com/storaged-project/libblockdev/pull/990 (In reply to Tomáš Bžatek from comment #22) > Hmm, isn't the compiler supposed to warn on invalid types supplied to > printf-like funcs? Perhaps the it should and I haven't seen any such warnings > > void bd_utils_log_format (gint level, const gchar *format, ...) { > definition is too vague and should be transformed to something like > > void bd_utils_log_format (gint level, const gchar *format) G_GNUC_FORMAT (2) { > (which expands to '__attribute__ ((format_arg (2)))') to get proper > annotations. However that requires a string return value which is not the > case here. > > Anyway, looking at lvm-dbus.c, the worst case of a bad value is passed in is > NULL which for "%s" should be gracefully translated to "(null)". > > Except... https://github.com/storaged-project/libblockdev/pull/990 This is currently happening on almost every build. its been over 10 days since it has not happened. *** Bug 2265238 has been marked as a duplicate of this bug. *** We signed off Beta today, so there's no point in Beta FE status any more. Clearing that nomination. The fixes from comment #22 were in 3.1.0, which is stable for F39, F40 and Rawhide. So I guess those did not fix this. So it's a bit of a shot in the dark, but some research into when this seems to have started combined with the history of libblockdev made me want to try this: https://src.fedoraproject.org/rpms/libblockdev/c/ffea9acbba2af77ed6b7eaa500b6388addf05eea?branch=rawhide . I've sent that build to Rawhide. I figure we can give it a day or two and see if the images clear up, or not. If they do, we know it's one of those few that causes the problem. Unfortunately, that didn't work. We're stratergerizing new dumb ways to try and pin this down now. :D Is there a way to reproduce the crash outside koji? Then we could experiment with instrumented libblockdev builds and similar ... That's one of the things I'm intending to look into. Though given the bug is a SIGILL, it's *possible* it's specific to the type of hardware these jobs run on. I'm going to fiddle around on a Jetson Nano I have here and see if I can reproduce it on that. Even if we can't, we're thinking it might be possible to write a little test script that runs enough of anaconda on the koji builders to hit the bug, or maybe use scratch builds. So I spent quite a lot of time on this today, but with inconclusive results :/ It's not that hard to reproduce what pungi/koji do - ultimately, run livemedia-creator in a mock chroot. So I've been doing that all day, tried it on my local aarch64 box first, then an openQA worker host I control, then finally on one of the actual Koji builders that hits the bug. I managed to reproduce the failure twice on the Koji builder, but that was twice out of like 50 attempts. Didn't reproduce it all on the other two hosts. It seems like it happens much less often when I'm trying to reproduce it manually than it does when we run the real composes, for some reason. That low level of reproducibility makes it hard to test anything with confidence :| One thing I did notice: when this bug happens, it seems to happen right after the "anaconda 41.2-2.fc41 for Fedora-i3-Live Rawhide (pre-release) started." message is printed. The thing anaconda does right after that is...a bunch of fiddling with signal handling. the 'print_startup_note' here prints that message, then see what happens next... startup_utils.print_startup_note(options=opts) from pyanaconda.anaconda import Anaconda anaconda = Anaconda() # reset python's default SIGINT handler signal.signal(signal.SIGINT, signal.SIG_IGN) signal.signal(signal.SIGTERM, lambda num, frame: sys.exit(1)) # synchronously-delivered signals such as SIGSEGV and SIGILL cannot be handled properly from # Python, so install signal handlers from the faulthandler stdlib module. import faulthandler faulthandler.enable() not sure if it's really relevant, but it seems like an interesting coincidence... Just to save anyone duplicating work while we confirm this - my last dumb diagnosis patch may have hit paydirt. I added a pre-log log before every bd_utils_log_format in lvm-dbus.c , and comparing the counts of pre-log messages and actual log messages in the resulting jobs, we found a discrepancy: [adamw@xps13a tmp]$ grep "FAILED TO CREATE" anaconda-dbus.log | wc -l 11 [adamw@xps13a tmp]$ grep "Failed to create" anaconda-dbus.log | wc -l 9 in one case we got a very nice smoking gun in the livemedia-out.log: 2024-03-23 09:04:49,305 INFO pylorax: Running anaconda. 2024-03-23 09:04:49,813 INFO pylorax: 2024-03-23 09:04:49,814 INFO pylorax: ** (process:1): CRITICAL **: 09:04:49.668: XXX ABOUT TO LOG 'FAILED TO CREATE A NEW CONNECTION' 2024-03-23 09:04:50,815 ERROR pylorax: Running anaconda failed: process '['unshare', '--pid', '--kill-child', '--mount', '--propagation', 'unchanged', 'anaconda', '--kickstart', '/chroot_tmpdir/koji-image-f41-build-115326908.ks', '--cmdline', '--dirinstall', '--remotelog', '127.0.0.1:39053']' was killed by signal 11 Note that there we see my 'pre-log' message but *not* the 'real' log message that should have come immediately after it, and then anaconda dies. So, we think the offending log line is likely this one: bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to create a new connection for the system bus: %s\n", (*error)->message); I suspect the issue is that the 'error' there is not always populated. There are several cases of this pattern in the file: we initialize an error variable, pass it to some function or other, and expect that if that function call fails, the error will be populated when it gets back to us. So here we're doing pretty much this, condensed and simplified (in fact the error creation is in one function and rest is in another which is passed 'error') - please excuse any mistakes in C pointer syntax while I was condensing this, I am not a licensed C coder :>): GError *error = NULL; bus = g_dbus_connection_new_for_address_sync (addr, G_DBUS_CONNECTION_FLAGS_AUTHENTICATION_CLIENT|G_DBUS_CONNECTION_FLAGS_MESSAGE_BUS_CONNECTION,NULL, NULL, &error); if (!bus || g_dbus_connection_is_closed (bus)) { bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to create a new connection for the system bus: %s\n", (*error)->message); } so what I suspect is happening is that it's not safe to assume 'error' will always have been populated by the g_dbus_connection_new_for_address_sync call (perhaps especially if we have a 'bus', but the connection to it is closed for some reason). The fix is probably something like this: if (error) { bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to create a new connection for the system bus: %s\n", (*error)->message); } else { bd_utils_log (BD_UTILS_LOG_CRIT, "Failed to create a new connection for the system bus: unknown reason"); } but for now we're testing a build which takes the log message out entirely, to see if that 'fixes' it. That build is going through tests currently, once it's stable we'll run a new compose and see what happens. `bd_lvm_init`, the function that calls `setup_dbus_connection` where the above logic is, similarly tries to log using the same error variable, so that call also needs guarding. I've sent a build with that changed also and will see how the next nightly compose fares. (In reply to Adam Williamson from comment #34) > GError *error = NULL; > bus = g_dbus_connection_new_for_address_sync (addr, > G_DBUS_CONNECTION_FLAGS_AUTHENTICATION_CLIENT|G_DBUS_CONNECTION_FLAGS_MESSAGE > _BUS_CONNECTION,NULL, NULL, &error); > if (!bus || g_dbus_connection_is_closed (bus)) { > bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to create a new > connection for the system bus: %s\n", (*error)->message); > } This is definitely suspect; on the one hand, it's checking that `bus` is NULL, and on the other it's checking if it's closed. If `bus` is NULL, then the `g_dbus_connection_new_for_address_sync` call must have failed and set `error`, but if it's not NULL, then there's no reason for `error` to be set. It might be easier to handle both this function and its caller(s?) by setting an error if necessary: bus = g_dbus_connection_new_for_address_sync (addr, G_DBUS_CONNECTION_FLAGS_AUTHENTICATION_CLIENT|G_DBUS_CONNECTION_FLAGS_MESSAGE_BUS_CONNECTION,NULL, NULL, &error); if (!bus || g_dbus_connection_is_closed (bus)) { if (error == NULL) { g_set_error_literal (&error, G_IO_ERROR, G_IO_ERROR_CLOSED, "connection was closed immediately"); } bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to create a new connection for the system bus: %s\n", (*error)->message); return; } "This is definitely suspect; on the one hand, it's checking that `bus` is NULL, and on the other it's checking if it's closed. If `bus` is NULL, then the `g_dbus_connection_new_for_address_sync` call must have failed and set `error`, but if it's not NULL, then there's no reason for `error` to be set." Yep, that's pretty much what I thought. For now I just want to check the theory, I didn't want to bother too merge about writing an upstreamable patch immediately, so I've converted both log messages to `bd_utils_log` calls with no templating. We'll see if that helps in the next compose. Unfortunately that didn't fix it :( And it was such a good-looking theory, too. I am at a bit of a loss, now. I do wonder if we can get some *more* coredumps and compare them, maybe that would shed some light? I suppose one thing that would help with trying to figure out what's going on is if each anaconda dbus module could prefix all its log messages in anaconda-dbus.log with its name. AIUI, we have 10 different modules starting up there, and one or more of them is crashing on this illegal instruction. But all the log messages from all of them are just thrown in that log together, which makes it difficult or impossible to distinguish them. If the messages were all prefixed we might be able to at least figure out how far each one that crashes gets before it crashes... Thanks, Adam, for your debugging efforts. I can only hope I haven't made a mistake in the initial analysis. So far I haven't been successful in reproducing the issue (on Ampere Mt.Snow machine). My steps to reproduce it are - grab mock config from koji (regular rawhide might work too) - init chroot with "mock -r ./f41-build.cfg init - mock -r ./f41-build.cfg install bash coreutils glibc-all-langpacks lorax-lmc-novirt selinux-policy-targeted shadow-utils util-linux - go into the mock shell - grap the kickstart with eg. "/sbin/livemedia-creator --ks ./koji-image-f41-build-115361236.ks --logfile /chroot_tmpdir/lmc-logs/livemedia-out.log --no-virt --resultdir /chroot_tmpdir/lmc --project Fedora-Workstation-Live --make-iso --volid Fedora-WS-Live-rawh-20240324.n.0 --iso-only --iso-name Fedora-Workstation-Live-aarch64-Rawhide-20240324.n.0.iso --releasever Rawhide --nomacboot" (taken from mock_output.log) - repeat previous step FWIW the last i3 Live images fail on some dependency error I think yeah, it was very difficult to reproduce for me too. I only managed it twice, trying to do it on the actual builder hardware (never reproduced it anywhere else), retrying over and over and over in a mock chroot. There's clearly something about the environment of an actual compose which makes it happen way more often. Kevin and I suspect it might be having multiple tasks including multiple live image builds running on the same host at once. That's a bit awkward to try and reproduce manually, of course. It definitely happens very frequently in real composes, we rarely get a compose without at least one image affected by the bug. So, hum. Looking at the two failures on this in today's Rawhide compose - https://koji.fedoraproject.org/koji/taskinfo?taskID=115400251 and https://koji.fedoraproject.org/koji/taskinfo?taskID=115400246 - the anaconda-dbus.log for each logs *ten* instances each of "XXX Failed to create a new connection for the system bus" and "XXX Failed to setup DBus connection" (which are the two messages we should see when hitting the path I modified). That's one short. When we don't see this problem, we log *eleven* instances each of those messages. I think that's one pair for the main process and one pair each for the ten modules that get started. The interesting thing is, if I'm following it right, the codepath we actually follow through lvm-dbus.c is very short. `bd_lvm_init` is, as its comment says, "called automatically by the library's initialization functions." It calls `setup_dbus_connection`, which I believe in the weird environment we are running in, is always expected to fail. It should then log "XXX Failed to setup DBus connection" and return FALSE. That's the whole codepath. That makes it look like the only other place it could really be failing is the other templated log call in setup_dbus_connection: bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to get system bus address: %s\n", (*error)->message); but...the fly in the ointment there is that, when I did the build with "pre-log" lines, we never hit the "pre-log" message for that log message, on any of the logs I checked. We always hit the "Failed to create a new connection for the system bus" one, not that earlier one. I'm really baffled as to how or where this is still blowing up. It's pretty weird. so, I poked at a recent coredump with gdb. I tried to follow what Danny did. the stack trace for this crash is: Stack trace of thread 38: #0 0x0000ffffb8f91fe0 n/a (ld-linux-aarch64.so.1 + 0xbfe0) #1 0x0000ffffa85e443c n/a (/usr/lib64/libbd_lvm-dbus.so.3.0.0 + 0x1443c) #2 0x0000ffffa8971648 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x11648) #3 0x0000ffffa89809ec n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x209ec) #4 0x0000ffffa89884a0 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x284a0) #5 0x0000ffffa8985d70 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x25d70) #6 0x0000ffffa925c054 n/a (/usr/lib64/libffi.so.8.1.4 + 0x1c054) #7 0x0000ffffa9256404 n/a (/usr/lib64/libffi.so.8.1.4 + 0x16404) #8 0x0000ffffa95512d4 n/a (/usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so + 0x312d4) #9 0x0000ffffa95508b8 n/a (/usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so + 0x308b8) #10 0x0000ffffb897b1f8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x17b1f8) #11 0x0000ffffb8987340 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x187340) #12 0x0000ffffb8a40f98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x240f98) #13 0x0000ffffb8a67f60 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x267f60) #14 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac) #15 0x0000ffffb898afd4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18afd4) #16 0x0000ffffb89abaf8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1abaf8) #17 0x0000ffffb89d8298 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d8298) #18 0x0000ffffb89d7838 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d7838) #19 0x0000ffffb89e6be0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1e6be0) #20 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac) #21 0x0000ffffb898afd4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18afd4) #22 0x0000ffffb89abaf8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1abaf8) #23 0x0000ffffb89d8298 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d8298) #24 0x0000ffffb89d7838 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d7838) #25 0x0000ffffb898c154 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18c154) #26 0x0000ffffb8a40f98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x240f98) #27 0x0000ffffb8a67f60 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x267f60) #28 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac) #29 0x0000ffffb898afd4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18afd4) #30 0x0000ffffb89abaf8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1abaf8) #31 0x0000ffffb89d8298 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d8298) #32 0x0000ffffb89d7838 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d7838) #33 0x0000ffffb898c154 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18c154) #34 0x0000ffffb8a40f98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x240f98) #35 0x0000ffffb8a67f60 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x267f60) #36 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac) #37 0x0000ffffb898afd4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18afd4) #38 0x0000ffffb89abaf8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1abaf8) #39 0x0000ffffb89d8298 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d8298) #40 0x0000ffffb89d7838 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d7838) #41 0x0000ffffb898c154 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18c154) #42 0x0000ffffb8a40f98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x240f98) #43 0x0000ffffb8a67f60 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x267f60) #44 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac) #45 0x0000ffffb898afd4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18afd4) #46 0x0000ffffb89abaf8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1abaf8) #47 0x0000ffffb89d8298 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d8298) #48 0x0000ffffb89d7838 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d7838) #49 0x0000ffffb898c154 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18c154) #50 0x0000ffffb8a40f98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x240f98) #51 0x0000ffffb8a67f60 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x267f60) #52 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac) #53 0x0000ffffb898afd4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18afd4) #54 0x0000ffffb89abaf8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1abaf8) #55 0x0000ffffb89d8298 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d8298) #56 0x0000ffffb89d7838 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d7838) #57 0x0000ffffb898c154 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18c154) #58 0x0000ffffb8a40f98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x240f98) #59 0x0000ffffb8a67f60 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x267f60) #60 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac) #61 0x0000ffffb89a1f38 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a1f38) #62 0x0000ffffb8987340 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x187340) #63 0x0000ffffb8a816e4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x2816e4) so in gdb I did this (note gdb is having issues with the trace, but never mind that I guess?): (gdb) bt #0 0x0000ffffb8f91fe0 in ?? () #1 0x0000ffffa85e4360 in ?? () at /usr/src/debug/libblockdev-3.1.0-6.fc41.aarch64/src/plugins/lvm-dbus.c:252 from /lib64/libbd_lvm-dbus.so.3 Backtrace stopped: previous frame inner to this frame (corrupt stack?) (gdb) disass 0x0000ffffa85e443c and this is what I got: Dump of assembler code for function bd_lvm_init: 0x0000ffffa85e4360 <+0>: paciasp 0x0000ffffa85e4364 <+4>: sub sp, sp, #0x40 0x0000ffffa85e4368 <+8>: adrp x0, 0xffffa860f000 0x0000ffffa85e436c <+12>: ldr x0, [x0, #4056] 0x0000ffffa85e4370 <+16>: stp x29, x30, [sp, #16] 0x0000ffffa85e4374 <+20>: add x29, sp, #0x10 0x0000ffffa85e4378 <+24>: stp x19, x20, [sp, #32] 0x0000ffffa85e437c <+28>: adrp x19, 0xffffa8610000 0x0000ffffa85e4380 <+32>: ldr x1, [x19, #64] 0x0000ffffa85e4384 <+36>: ldr x2, [x0] 0x0000ffffa85e4388 <+40>: str x2, [sp, #8] 0x0000ffffa85e438c <+44>: mov x2, #0x0 // #0 0x0000ffffa85e4390 <+48>: str xzr, [sp] 0x0000ffffa85e4394 <+52>: cbz x1, 0xffffa85e43e0 <bd_lvm_init+128> 0x0000ffffa85e4398 <+56>: adrp x0, 0xffffa85e0000 <_init> 0x0000ffffa85e439c <+60>: add x0, x0, #0xac0 0x0000ffffa85e43a0 <+64>: bl 0xffffa85e0840 <dm_log_with_errno_init@plt> 0x0000ffffa85e43a4 <+68>: mov w0, #0x6 // #6 0x0000ffffa85e43a8 <+72>: bl 0xffffa85e0060 <dm_log_init_verbose@plt> 0x0000ffffa85e43ac <+76>: mov w0, #0x1 // #1 0x0000ffffa85e43b0 <+80>: adrp x1, 0xffffa860f000 0x0000ffffa85e43b4 <+84>: ldr x1, [x1, #4056] 0x0000ffffa85e43b8 <+88>: ldr x3, [sp, #8] 0x0000ffffa85e43bc <+92>: ldr x2, [x1] 0x0000ffffa85e43c0 <+96>: subs x3, x3, x2 0x0000ffffa85e43c4 <+100>: mov x2, #0x0 // #0 0x0000ffffa85e43c8 <+104>: b.ne 0xffffa85e4488 <bd_lvm_init+296> // b.any 0x0000ffffa85e43cc <+108>: ldp x29, x30, [sp, #16] 0x0000ffffa85e43d0 <+112>: ldp x19, x20, [sp, #32] 0x0000ffffa85e43d4 <+116>: add sp, sp, #0x40 0x0000ffffa85e43d8 <+120>: autiasp 0x0000ffffa85e43dc <+124>: ret 0x0000ffffa85e43e0 <+128>: mov x2, sp 0x0000ffffa85e43e4 <+132>: mov w0, #0x1 // #1 0x0000ffffa85e43e8 <+136>: str x21, [sp, #48] 0x0000ffffa85e43ec <+140>: bl 0xffffa85e06b0 <g_dbus_address_get_for_bus_sync@plt> 0x0000ffffa85e43f0 <+144>: mov x20, x0 0x0000ffffa85e43f4 <+148>: cbz x0, 0xffffa85e446c <bd_lvm_init+268> 0x0000ffffa85e43f8 <+152>: mov w1, #0x9 // #9 0x0000ffffa85e43fc <+156>: mov x4, sp 0x0000ffffa85e4400 <+160>: mov x3, #0x0 // #0 0x0000ffffa85e4404 <+164>: mov x2, #0x0 // #0 0x0000ffffa85e4408 <+168>: bl 0xffffa85e0490 <g_dbus_connection_new_for_address_sync@plt> 0x0000ffffa85e440c <+172>: mov x1, x0 0x0000ffffa85e4410 <+176>: mov x0, x20 0x0000ffffa85e4414 <+180>: str x1, [x19, #64] 0x0000ffffa85e4418 <+184>: bl 0xffffa85e0080 <g_free@plt> 0x0000ffffa85e441c <+188>: ldr x0, [x19, #64] 0x0000ffffa85e4420 <+192>: cbz x0, 0xffffa85e442c <bd_lvm_init+204> 0x0000ffffa85e4424 <+196>: bl 0xffffa85e0900 <g_dbus_connection_is_closed@plt> 0x0000ffffa85e4428 <+200>: cbz w0, 0xffffa85e4458 <bd_lvm_init+248> 0x0000ffffa85e442c <+204>: adrp x1, 0xffffa85f0000 0x0000ffffa85e4430 <+208>: mov w0, #0x2 // #2 0x0000ffffa85e4434 <+212>: add x1, x1, #0xa88 0x0000ffffa85e4438 <+216>: bl 0xffffa85e04a0 <bd_utils_log@plt> 0x0000ffffa85e443c <+220>: mov w0, #0x2 // #2 0x0000ffffa85e4440 <+224>: adrp x1, 0xffffa85f0000 0x0000ffffa85e4444 <+228>: add x1, x1, #0xac8 0x0000ffffa85e4448 <+232>: bl 0xffffa85e04a0 <bd_utils_log@plt> 0x0000ffffa85e444c <+236>: ldr x21, [sp, #48] 0x0000ffffa85e4450 <+240>: mov w0, #0x0 // #0 0x0000ffffa85e4454 <+244>: b 0xffffa85e43b0 <bd_lvm_init+80> 0x0000ffffa85e4458 <+248>: ldr x0, [x19, #64] 0x0000ffffa85e445c <+252>: mov w1, #0x0 // #0 0x0000ffffa85e4460 <+256>: bl 0xffffa85e0990 <g_dbus_connection_set_exit_on_close@plt> 0x0000ffffa85e4464 <+260>: ldr x21, [sp, #48] 0x0000ffffa85e4468 <+264>: b 0xffffa85e4398 <bd_lvm_init+56> 0x0000ffffa85e446c <+268>: ldr x2, [sp] 0x0000ffffa85e4470 <+272>: adrp x1, 0xffffa85f0000 0x0000ffffa85e4474 <+276>: mov w0, #0x2 // #2 0x0000ffffa85e4478 <+280>: add x1, x1, #0xa60 0x0000ffffa85e447c <+284>: ldr x2, [x2, #8] 0x0000ffffa85e4480 <+288>: bl 0xffffa85e0260 <bd_utils_log_format@plt> 0x0000ffffa85e4484 <+292>: b 0xffffa85e443c <bd_lvm_init+220> 0x0000ffffa85e4488 <+296>: str x21, [sp, #48] 0x0000ffffa85e448c <+300>: bl 0xffffa85e0510 <__stack_chk_fail@plt> so...the top part there looks different from before, but the bottom part looks much the same...we've still got bd_lvm_init , g_dbus_connection_set_exit_on_close , and bd_utils_log_format ... I'm really not sure where we'd still be getting a bd_utils_log_format , though, aside from the other one in setup_dbus_connection that it didn't look like we were hitting...I guess we can *try* modifying that one too, but it doesn't feel right... Oh, or does the fact that it now says "Dump of assembler code for function bd_lvm_init:" mean we're crashing in bd_lvm_init rather than in bd_utils_log_format now? Created attachment 2023553 [details]
updated coredump with libblockdev-3.1.0-6.fc41
Here is the updated coredump, in case it tells anyone else anything more than it told me.
What hw platform are the compose builders? I could try reserving the same in beaker for further debugging ... (In reply to Adam Williamson from comment #45) > Oh, or does the fact that it now says "Dump of assembler code for function > bd_lvm_init:" mean we're crashing in bd_lvm_init rather than in > bd_utils_log_format now? Indeed we are (from your coredump): (gdb) list *0x0000ffffa85e443c 0xffffa85e443c is in bd_lvm_init (/usr/src/debug/libblockdev-3.1.0-6.fc41.aarch64/src/plugins/lvm-dbus.c:339). Downloading source file /usr/src/debug/libblockdev-3.1.0-6.fc41.aarch64/src/plugins/lvm-dbus.c 334 GError *error = NULL; 335 336 /* the check() call should create the DBus connection for us, but let's not 337 completely rely on it */ 338 if (G_UNLIKELY (!bus) && !setup_dbus_connection (&error)) { 339 bd_utils_log (BD_UTILS_LOG_CRIT, "XXX Failed to setup DBus connection"); 340 return FALSE; 341 } 342 343 dm_log_with_errno_init ((dm_log_with_errno_fn) redirect_dm_log); However, I think `setup_dbus_connection` was inlined, as if you do the previous line from the disassembly, you get the inner debug line: (gdb) list *0x0000ffffa85e4438 0xffffa85e4438 is in bd_lvm_init (/usr/src/debug/libblockdev-3.1.0-6.fc41.aarch64/src/plugins/lvm-dbus.c:272). 267 NULL, NULL, error); 268 269 g_free (addr); 270 271 if (!bus || g_dbus_connection_is_closed (bus)) { 272 bd_utils_log (BD_UTILS_LOG_CRIT, "XXX Failed to create a new connection for the system bus"); 273 return FALSE; 274 } 275 276 g_dbus_connection_set_exit_on_close (bus, FALSE); I am thinking about building libblockdev with -O0 to avoid "over-optimization" and similar issues, but the library doesn't want to give up :-) But diff --git a/libblockdev.spec b/libblockdev.spec index c5c4b71..dcfca81 100644 --- a/libblockdev.spec +++ b/libblockdev.spec @@ -576,7 +576,13 @@ A meta-package that pulls all the libblockdev plugins as dependencies. %autosetup -n %{name}-%{version} -p1 %build +find . -name Makefile.am | xargs sed -i -e 's/-Werror//g' + autoreconf -ivf + +%global _fortify_level 0 +%global optflags $(echo %optflags | sed -e 's/-O2/-O0/g') -Wp,-D_FORTIFY_SOURCE=0 + %configure %{?configure_opts} %{__make} %{?_smp_mflags} seems to do it. Big thanks for investigating this! I've created this PR for upstream based on your findings: https://github.com/storaged-project/libblockdev/pull/1012 (In reply to Dan Horák from comment #47) > What hw platform are the compose builders? I could try reserving the same in > beaker for further debugging ... The builders in the aarch64 livemedia channel are mostly emags, but 2 cavium thunderX2's. I have some mt snow I want to add, but haven't yet. This issue appears on both the emags and the thunderx2's. So...apologies for dumb questions, but what does the info from Elliott's comment tell us exactly? Are the line numbers (339 and 272) significant - are we really somehow still failing on the log lines, even though there's no templating any more? Or is it more complicated than that? Sorry, I am used to Python traces, where at least I pretty much always know exactly where it's breaking. :P Other than that, I guess I'll do a build with Dan's patch and we can see how the coredumps look after the next compose? OK, I'm running a build now that both replaces my dumb XXX patch with Vojtech's upstream PR, and implements Dan's de-optimization changes: https://koji.fedoraproject.org/koji/taskinfo?taskID=115471730 (libblockdev-3.1.1-2.fc41). We'll see how things go with the next compose. Another hint: try clang? (In reply to Tomáš Bžatek from comment #54) > Another hint: try clang? yup, and also instrumented build with ubsan and/or asan Ooo. It looks like the latest build might have fixed it (or we got really lucky). Workstation and KDE both built OK this time. lxqt and i3 failed, but they did not fail on the SIGILL - they failed much later, due to lack of disk space. I'm not sure if Vojtech's better fix for the logging issues did the trick, or if somehow the bug goes away with the de-optimization changes, or we just got lucky. I'm going to do another build *with* Vojtech's patch but *without* the de-optimization changes and we can watch the next compose. I also sent a couple of PRs to bump the space for i3 and lxqt, so those two should pass if they don't hit the SIGILL (I hope). I suspect it passed with -2 due the de-optimization, as it fails again with -3 when the de-op has been removed ... Ah, foo, indeed. So, how do you want to proceed? Try with -O1 ? good question, I would keep Vojtech's patch (looks right to me) and I would add the gcc sanitizer options, start with "undefined behaviour", then try "address", and just for aarch64 to be on the safe side with regard to other arches --- a/libblockdev.spec +++ b/libblockdev.spec @@ -576,7 +576,14 @@ A meta-package that pulls all the libblockdev plugins as dependencies. %autosetup -n %{name}-%{version} -p1 %build autoreconf -ivf + +%ifarch aarch64 +%global optflags %optflags -fsanitize=undefined +%endif + %configure %{?configure_opts} %{__make} %{?_smp_mflags} worst case we might be hitting some CPU design issue (cache not invalidated or such) ... And thus it might be useful to know if moving the jobs to the much newer Mt.Snows would resolve it. But let's start with the sanitizers and maybe better targeted de-optimizing. OK, can do. Is the goal here to find a minimal level of de-optimization that makes the bug go away, or to debug it further, or both? Anyway, will apply that change and run another build. The de-optimization can be done on source file or even on function level (and we have some suspects), so it would serve both, I think. But the goal should be to get to the root cause, but this "remote" debugging doesn't make it easy :-) yeah, unfortunately I can't find an easy way to tighten the loop :/ it's not really the remoteness so much as the cycle; it takes about five hours for us to get from start of a compose to success/failure of the aarch64 lives, I think. Created attachment 2024016 [details]
another updated coredump with libblockdev 3.1.1-4 (with -fsanitize=undefined)
Dan: ping? I know we all got distracted by xz :D but did that last coredump help any? (In reply to Adam Williamson from comment #64) > Dan: ping? I know we all got distracted by xz :D but did that last coredump > help any? not really :-( I was hoping for some output from the "undefined behaviour" sanitizer, my next step is address sanitizer in libblockdev-3.1.1-5.fc41 So... that didn't work at all. ;) https://koji.fedoraproject.org/koji/taskinfo?taskID=115944106 and the screenshot has a dead anaconda with: "==2499==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD" I have untagged it for now so we can get a compose... Hmm, that's unfortunate :-( I will revert the spec back to "normal" build before any next steps. I'm thinking that for the next Fedora 40 Final RC, at least, we might want to put in a 'detuned' (per comment #49) libblockdev build so we get all the images. It's not the correct fix, of course, but it *does* avoid the bug in practice. Doing this would mean we'd have to ship the detuned libblockdev build as part of the release, though, and of course it would be in *all* images in the compose, not just used to build the affected aarch64 live images. How bad is the performance impact of losing those optimizations? Do you think it's bad enough that this would be a bad plan, Dan? proposing as FinalFE for that purpose. +3 in https://pagure.io/fedora-qa/blocker-review/issue/1588 , marking accepted FE. this means we *can* include a detuned build in the next compose, but we'll still wait for Dan's opinion on whether we *should*. I believe libblockdev isn't performance critical, so it makes sense to do a de-optimized (on aarch64 only) build. Let me prepare a rawhide build ASAP. Adam, feel free to issue a F-40 too, seems it can't be fast-forwarded, but a cherry-pick into the f40 branch should work. Unfortunately I am mostly out of ideas about the next steps to get to the root cause :-( oh, yeah, I forgot we can apply the deoptimization only on aarch64, which limits the impact. cool. Yeah, go ahead and do a Rawhide build and I'll take care of F40. rawhide build is https://koji.fedoraproject.org/koji/taskinfo?taskID=116239319 And for the next step I will try to de-optimize just the "utils" internal library ... So, 3 failed aarch64 live's with that build. ;( https://koji.fedoraproject.org/koji/taskinfo?taskID=116257542 https://koji.fedoraproject.org/koji/taskinfo?taskID=116257533 https://koji.fedoraproject.org/koji/taskinfo?taskID=116257552 Adding Florian here in case he has ideas or knows anyone in tools space that might be able to see the problem more clearly than we... Most of the disassembly here is for pthread_kill, which is explicitly called for signal 4, which is SIGILL. This must come from the (Anaconda?) crash handler. These in-process crash handlers are best avoided because they may obfuscate what's going on. A coredump without a crash handler (such as captured by systemd-coredump) may have more information. The real crash is in the frame that comes in a GDB backtrace right after the line with <signal handler called>. The fault address 0x0000ffffa85e443c seems bogus because that instruction cannot fault. So unless something rewrites the file in place, I don't see how this can crash. (In reply to Kevin Fenzi from comment #74) > So, 3 failed aarch64 live's with that build. ;( > > https://koji.fedoraproject.org/koji/taskinfo?taskID=116257542 > https://koji.fedoraproject.org/koji/taskinfo?taskID=116257533 > https://koji.fedoraproject.org/koji/taskinfo?taskID=116257552 > > Adding Florian here in case he has ideas or knows anyone in tools space that > might be able to see the problem more clearly than we... FWIW only the latter 2 failed due the SIGILL if I see right ... Florian: I'm pretty sure I got the coredumps I attached by using `coredumpctl dump`, which is from systemd-coredump, isn't it? (In reply to Adam Williamson from comment #77) > Florian: I'm pretty sure I got the coredumps I attached by using > `coredumpctl dump`, which is from systemd-coredump, isn't it? Yes, but some of the backtraces seem to have the crash handler on the stack, rather than the actual stack contents at the crash site. So, branched compose failed last night... workstation live failed, but this time it was not only the aarch64 one, but also the x86_64 one. ;( https://koji.fedoraproject.org/koji/taskinfo?taskID=116308658 Sadly, we don't have the core there. We never enabled it for x86_64 builders, only aarch64 buildhw's. This is really odd. The circumstances for SIGILL are not exactly the same between AArch64 and x86-64. For example, a GCC-generated trap results in SIGILL on x86-64, but on AArch64 it's SIGTRAP. |