Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1748145
Summary: | process segfaults but systemd-coredump does not capture it | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Chris Murphy <bugzilla> | ||||
Component: | gnome-shell | Assignee: | Florian Müllner <fmuellner> | ||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 31 | CC: | awilliam, fmuellner, gnome-sig, jadahl, lnykryn, mcatanzaro+wrong-account-do-not-cc, msekleta, otaylor, philip.wyett, ssahani, s, systemd-maint, zbyszek | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2019-09-30 00:01:48 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1747845 | ||||||
Attachments: |
|
Description
Chris Murphy
2019-09-03 01:45:42 UTC
Created attachment 1611015 [details]
journalctl -b -o short-monotonic, with log_level debug
Yes, I can reproduce the same effect by simply killing gnome-shell. 1. Kill a normal program: $ bash -c 'kill -SEGV $$' Segmentation fault (core dumped) Sep 03 09:20:20 workstation-uefi audit[2337]: ANOM_ABEND auid=1000 uid=1000 gid=1000 ses=6 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=2337 comm="bash" exe="/usr/bin/bash" sig=11 res=1 Sep 03 09:20:20 workstation-uefi systemd[1]: Started Process Core Dump (PID 2338/UID 0). Sep 03 09:20:20 workstation-uefi audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@3-2338-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' Sep 03 09:20:20 workstation-uefi systemd-coredump[2339]: Process 2337 (bash) of user 1000 dumped core. Stack trace of thread 2337: #0 0x00007fdf633c591b kill (libc.so.6) #1 0x000055bb1633fd2f kill_pid (bash) #2 0x000055bb16380a12 kill_builtin (bash) #3 0x000055bb16329f0e execute_builtin.isra.0 (bash) #4 0x000055bb1632e6f9 execute_command_internal (bash) #5 0x000055bb1637c46b parse_and_execute (bash) #6 0x000055bb16315adb run_one_command (bash) #7 0x000055bb16314711 main (bash) #8 0x00007fdf633b0193 __libc_start_main (libc.so.6) #9 0x000055bb1631549e _start (bash) Sep 03 09:20:20 workstation-uefi systemd[1]: systemd-coredump: Succeeded. 2. kill gnome-shell $ ps 2643 PID TTY STAT TIME COMMAND 2643 tty1 Sl+ 0:03 /usr/bin/gnome-shell $ sudo kill -SEGV 2643 Nothing in the logs! $ sudo kill -SEGV 2643 Sep 03 09:18:08 workstation-uefi sudo[2327]: fedora : TTY=pts/0 ; PWD=/home/fedora ; USER=root ; COMMAND=/usr/bin/kill -SEGV 2643 ... Sep 03 09:18:08 workstation-uefi gsd-wacom[1757]: Error reading events from display: Broken pipe Sep 03 09:18:08 workstation-uefi gnome-session[1637]: gnome-session-binary[1637]: WARNING: App 'org.gnome.SettingsDaemon.Wacom.desktop' exited with code 1 Sep 03 09:18:08 workstation-uefi gnome-session-binary[1637]: WARNING: App 'org.gnome.SettingsDaemon.Wacom.desktop' exited with code 1 Sep 03 09:18:08 workstation-uefi org.gnome.Shell.desktop[1646]: (EE) failed to read Wayland events: Connection reset by peer Sep 03 09:18:08 workstation-uefi polkitd[719]: Unregistered Authentication Agent for unix-session:c2 (system bus name :1.343, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnected from bus) Sep 03 09:18:08 workstation-uefi gnome-session[1637]: gnome-session-binary[1637]: WARNING: Application 'org.gnome.Shell.desktop' killed by signal 11 Sep 03 09:18:08 workstation-uefi gnome-session-binary[1637]: WARNING: Application 'org.gnome.Shell.desktop' killed by signal 11 Sep 03 09:18:08 workstation-uefi ibus-daemon[1693]: GChildWatchSource: Exit status of a child process was requested but ECHILD was received by waitpid(). See the documentation of g_child_watch_source_new() for possible causes. Sep 03 09:18:08 workstation-uefi gnome-session-binary[1637]: Unrecoverable failure in required component org.gnome.Shell.desktop Other processes I tested get coredumps normally. My guess is that gnome-shell installs some special handler for SEGV, and that it screws things up somehow. Reassigning to gnome-shell for feedback. gnome-shell will only catch SIGSEGV if SHELL_DEBUG is set to "backtrace-segfaults". With that said, in that case, it will still forward the signal after having printed a gjs backtrace to stderr. It's clearly not working ;( $ sudo kill -SEGV 2957 $ sudo strace -p 2957 strace: Process 2957 attached restart_syscall(<... resuming interrupted read ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_USER, si_pid=3174, si_uid=0} --- rt_sigaction(SIGSEGV, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fafd2f436a0}, NULL, 8) = 0 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) clock_gettime(CLOCK_MONOTONIC, {tv_sec=56522, tv_nsec=526666023}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=56522, tv_nsec=526898469}) = 0 recvmsg(32, {msg_namelen=0}, 0) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=15, events=POLLIN}, {fd=22, events=POLLIN}, {fd=30, events=POLLIN}, {fd=32, events=POLLIN}, {fd=35, events=POLLIN}, {fd=36, events=POLLIN}, {fd=39, events=POLLIN}, {fd=41, events=POLLIN}, {fd=42, events=0}, {fd=46, events=POLLIN}, {fd=48, events=POLLIN}, {fd=49, events=POLLIN}], 16, 248382) = 1 ([{fd=4, revents=POLLIN}]) read(4, "\2\0\0\0\0\0\0\0", 16) = 8 clock_gettime(CLOCK_MONOTONIC, {tv_sec=56532, tv_nsec=16851710}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=56532, tv_nsec=17612220}) = 0 ... $ sudo grep SHELL_DEBUG /proc/2957/environ (nothing) $ sudo kill -SEGV 2957 $ sudo strace -p 2957 strace: Process 2957 attached restart_syscall(<... resuming interrupted read ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_USER, si_pid=3224, si_uid=0} --- +++ killed by SIGSEGV +++ Seems to be caught by some signal handler in libmozjs: https://github.com/ptomato/mozjs/blob/mozjs60/js/src/wasm/WasmSignalHandlers.cpp#L1733 I think it's a F31 regression though. I was able to report many gnome-shell crashes in F30 with no problems. If the mozjs signal handler is to blame, please coordinate with them to sort this out ASAP. We can't ship a non-debuggable desktop.... (In reply to Michael Catanzaro from comment #6) > If the mozjs signal handler is to blame It's hard to read what it's doing, but it looks like it's designed to only catch wasm faults and nicely reraise the fatal signal for normal crashes. Also, we've used mozjs60 since Fedora 29 so it seems unlikely that anything has changed here recently. I wonder if we're getting coredumps from *anything* that crashes? Try SIGSEGVing something that has nothing to do with GNOME, maybe... I'm getting coredumps if I 'sudo kill -s 11 <gnomemapspid>'. I also get a massive pile of AVCs Sep 25 21:37:51 flap.local audit[8217]: AVC avc: denied { write } for pid=8217 comm="abrt-action-lis" name=".dbenv.lock" dev="nvme0n1p7" ino=773614 scontext=system_u:system_r:abrt_t:s0-s0:c0.c1023 tcontext=unconfined_u:object_r:var_lib_t:s0 tclass=file permissive=0 coredumpctl lists it; and abrt lists it as well, but abrt says: The auto-reporting feature is disabled. Please consider enabling it by issuing “abrt-auto-reporting enabled” as a user with root privileges. And also I stumbled on what seems to be a significant issue with the retrace server... https://github.com/abrt/retrace-server/issues/258 (In reply to Adam Williamson from comment #8) > I wonder if we're getting coredumps from *anything* that crashes? Try > SIGSEGVing something that has nothing to do with GNOME, maybe... coredumpctl is working fine. Sadly I have no shortage of crashes to prove this. :P And yes, the retrace server is broken currently, but that is not related to this issue. Seems sudo setcap -r `which gnome-shell` makes them come back. Who's fault it is that cap_sys_nice+ep set during install eats core dumps, I don't know. Incredible. So that capability is there to test the real-time scheduler, which is a mutter experimental feature off by default. Anyone enabling the experimental feature needs to edit a gsettings key. As a short-term solution, we might as well remove the capability and let people testing the real-time scheduler add it manually. But as a long-term solution, I don't know. We probably want the real-time scheduler, but surely not at the cost of core dumps. I don't know. Jonas: any objection to removing the capability from the RPM spec (until we find a better answer)? Jonas, kudos!
Unfortunately, this is intentional. core(5) says:
> There are various circumstances in which a core dump file is not produced:
> ...
> * The process is executing a set-user-ID (set-group-ID) program that is owned
> by a user (group) other than the real user (group) ID of the process, or the
> process is executing a program that has file capabilities (see capabili‐
> ties(7)). (However, see the description of the prctl(2) PR_SET_DUMPABLE
> operation, and the description of the /proc/sys/fs/suid_dumpable file in
> proc(5).)
gnome-shell should call prctl(SET_DUMPABLE, 1);
We know that it is OK for the user to have access to all capabalities/information of that process.
This will have the additional advantage that gnome-shell will be debuggable by the user.
Right now 'gdb -p $(pidof gnome-shell)' fails with EPERM.
(In reply to Zbigniew Jędrzejewski-Szmek from comment #13) > Jonas, kudos! It was pointed out by someone on IRC, I just verified :) > > Unfortunately, this is intentional. core(5) says: > > There are various circumstances in which a core dump file is not produced: > > ... > > * The process is executing a set-user-ID (set-group-ID) program that is owned > > by a user (group) other than the real user (group) ID of the process, or the > > process is executing a program that has file capabilities (see capabili‐ > > ties(7)). (However, see the description of the prctl(2) PR_SET_DUMPABLE > > operation, and the description of the /proc/sys/fs/suid_dumpable file in > > proc(5).) > > gnome-shell should call prctl(SET_DUMPABLE, 1); > We know that it is OK for the user to have access to all > capabalities/information of that process. > This will have the additional advantage that gnome-shell will be debuggable > by the user. > Right now 'gdb -p $(pidof gnome-shell)' fails with EPERM. Seems to do the trick indeed. Created https://gitlab.gnome.org/GNOME/mutter/merge_requests/811. FEDORA-2019-94130905d5 has been submitted as an update to Fedora 31. https://bodhi.fedoraproject.org/updates/FEDORA-2019-94130905d5 mutter-3.34.0-5.fc31 has been pushed to the Fedora 31 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-94130905d5 mutter-3.34.0-5.fc31 has been pushed to the Fedora 31 stable repository. If problems still persist, please make note of it in this bug report. |