Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1748145 - process segfaults but systemd-coredump does not capture it
Summary: process segfaults but systemd-coredump does not capture it
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: gnome-shell
Version: 31
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Florian Müllner
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 1747845
TreeView+ depends on / blocked
 
Reported: 2019-09-03 01:45 UTC by Chris Murphy
Modified: 2019-09-30 00:01 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-09-30 00:01:48 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
journalctl -b -o short-monotonic, with log_level debug (6.99 MB, text/plain)
2019-09-03 01:46 UTC, Chris Murphy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
GNOME Gitlab GNOME gnome-shell issues 1705 0 None None None 2019-09-26 15:49:46 UTC

Description Chris Murphy 2019-09-03 01:45:42 UTC
systemd-243~rc2-2.fc31.x86_64

- gnome-shell is crashing, but there's no coredumps anywhere
- Boot with systemd.log_level=debug
- Attaching journal with monotonic times matching this description:

GNOME login screen, timer to blank screen kicks in here and backlight goes off (expected)
[  324.870776] fmac.local gnome-session-binary[1092]: gnome-session-binary[1092]: DEBUG(+): GsmPresence: setting idle: 1


I tap the trackpad, and backlight comes one, but no graphical login shows, instead I see text only, and discover gnome-shell has crashed but there's no coredump.
[  794.590264] fmac.local gnome-session-binary[1092]: gnome-session-binary[1092]: DEBUG(+): GsmPresence: setting idle: 0

Crash but no systemd-coredump capture?
[  795.294913] fmac.local kernel: gnome-shell[1109]: segfault at 58 ip 00007f4dd45daa5a sp 00007fff99ea6e90 error 4 in libmutter-5.so.0.0.0[7f4dd44eb000+fa000]


[chris@fmac ~]$ sudo coredumpctl
No coredumps found.
[chris@fmac ~]$ sudo abrt-cli list
No problems

The auto-reporting feature is disabled. Please consider enabling it by issuing “abrt-auto-reporting enabled” as a user with root privileges.
[chris@fmac ~]$ 

Why no coredump collected?

[chris@fmac ~]$ systemctl list-unit-files | grep core
abrt-journal-core.service                    enabled        
abrt-vmcore.service                          enabled        
systemd-coredump@.service                    static         
systemd-coredump.socket                      static         
[chris@fmac ~]$ systemctl list-unit-files | grep abrt
abrt-ccpp.service                            disabled       
abrt-journal-core.service                    enabled        
abrt-oops.service                            enabled        
abrt-pstoreoops.service                      disabled       
abrt-vmcore.service                          enabled        
abrt-xorg.service                            enabled        
abrtd.service                                enabled        
[chris@fmac ~]$

Comment 1 Chris Murphy 2019-09-03 01:46:39 UTC
Created attachment 1611015 [details]
journalctl -b -o short-monotonic, with log_level debug

Comment 2 Zbigniew Jędrzejewski-Szmek 2019-09-03 07:38:07 UTC
Yes, I can reproduce the same effect by simply killing gnome-shell.

1. Kill a normal program:
$ bash -c 'kill -SEGV $$'
Segmentation fault (core dumped)

Sep 03 09:20:20 workstation-uefi audit[2337]: ANOM_ABEND auid=1000 uid=1000 gid=1000 ses=6 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=2337 comm="bash" exe="/usr/bin/bash" sig=11 res=1
Sep 03 09:20:20 workstation-uefi systemd[1]: Started Process Core Dump (PID 2338/UID 0).
Sep 03 09:20:20 workstation-uefi audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@3-2338-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Sep 03 09:20:20 workstation-uefi systemd-coredump[2339]: Process 2337 (bash) of user 1000 dumped core.
                                                         
                                                         Stack trace of thread 2337:
                                                         #0  0x00007fdf633c591b kill (libc.so.6)
                                                         #1  0x000055bb1633fd2f kill_pid (bash)
                                                         #2  0x000055bb16380a12 kill_builtin (bash)
                                                         #3  0x000055bb16329f0e execute_builtin.isra.0 (bash)
                                                         #4  0x000055bb1632e6f9 execute_command_internal (bash)
                                                         #5  0x000055bb1637c46b parse_and_execute (bash)
                                                         #6  0x000055bb16315adb run_one_command (bash)
                                                         #7  0x000055bb16314711 main (bash)
                                                         #8  0x00007fdf633b0193 __libc_start_main (libc.so.6)
                                                         #9  0x000055bb1631549e _start (bash)
Sep 03 09:20:20 workstation-uefi systemd[1]: systemd-coredump: Succeeded.

2. kill gnome-shell

$ ps 2643
    PID TTY      STAT   TIME COMMAND
   2643 tty1     Sl+    0:03 /usr/bin/gnome-shell

$ sudo kill -SEGV 2643
Nothing in the logs!

$ sudo kill -SEGV 2643

Sep 03 09:18:08 workstation-uefi sudo[2327]:   fedora : TTY=pts/0 ; PWD=/home/fedora ; USER=root ; COMMAND=/usr/bin/kill -SEGV 2643
...
Sep 03 09:18:08 workstation-uefi gsd-wacom[1757]: Error reading events from display: Broken pipe
Sep 03 09:18:08 workstation-uefi gnome-session[1637]: gnome-session-binary[1637]: WARNING: App 'org.gnome.SettingsDaemon.Wacom.desktop' exited with code 1
Sep 03 09:18:08 workstation-uefi gnome-session-binary[1637]: WARNING: App 'org.gnome.SettingsDaemon.Wacom.desktop' exited with code 1
Sep 03 09:18:08 workstation-uefi org.gnome.Shell.desktop[1646]: (EE) failed to read Wayland events: Connection reset by peer
Sep 03 09:18:08 workstation-uefi polkitd[719]: Unregistered Authentication Agent for unix-session:c2 (system bus name :1.343, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnected from bus)
Sep 03 09:18:08 workstation-uefi gnome-session[1637]: gnome-session-binary[1637]: WARNING: Application 'org.gnome.Shell.desktop' killed by signal 11
Sep 03 09:18:08 workstation-uefi gnome-session-binary[1637]: WARNING: Application 'org.gnome.Shell.desktop' killed by signal 11
Sep 03 09:18:08 workstation-uefi ibus-daemon[1693]: GChildWatchSource: Exit status of a child process was requested but ECHILD was received by waitpid(). See the documentation of g_child_watch_source_new() for possible causes.
Sep 03 09:18:08 workstation-uefi gnome-session-binary[1637]: Unrecoverable failure in required component org.gnome.Shell.desktop

Other processes I tested get coredumps normally.
My guess is that gnome-shell installs some special handler for SEGV, and that it screws things
up somehow. Reassigning to gnome-shell for feedback.

Comment 3 Jonas Ådahl 2019-09-03 07:41:33 UTC
gnome-shell will only catch SIGSEGV if SHELL_DEBUG is set to "backtrace-segfaults". With that said, in that case, it will still forward the signal after having printed a gjs backtrace to stderr.

Comment 4 Zbigniew Jędrzejewski-Szmek 2019-09-03 07:57:21 UTC
It's clearly not working ;(

$ sudo kill -SEGV 2957
$ sudo strace -p 2957
strace: Process 2957 attached
restart_syscall(<... resuming interrupted read ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_USER, si_pid=3174, si_uid=0} ---
rt_sigaction(SIGSEGV, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fafd2f436a0}, NULL, 8) = 0
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
clock_gettime(CLOCK_MONOTONIC, {tv_sec=56522, tv_nsec=526666023}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=56522, tv_nsec=526898469}) = 0
recvmsg(32, {msg_namelen=0}, 0)         = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=15, events=POLLIN}, {fd=22, events=POLLIN}, {fd=30, events=POLLIN}, {fd=32, events=POLLIN}, {fd=35, events=POLLIN}, {fd=36, events=POLLIN}, {fd=39, events=POLLIN}, {fd=41, events=POLLIN}, {fd=42, events=0}, {fd=46, events=POLLIN}, {fd=48, events=POLLIN}, {fd=49, events=POLLIN}], 16, 248382) = 1 ([{fd=4, revents=POLLIN}])
read(4, "\2\0\0\0\0\0\0\0", 16)         = 8
clock_gettime(CLOCK_MONOTONIC, {tv_sec=56532, tv_nsec=16851710}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=56532, tv_nsec=17612220}) = 0
...

$ sudo grep SHELL_DEBUG /proc/2957/environ
(nothing)

$ sudo kill -SEGV 2957
$ sudo strace -p 2957
strace: Process 2957 attached
restart_syscall(<... resuming interrupted read ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_USER, si_pid=3224, si_uid=0} ---
+++ killed by SIGSEGV +++

Comment 5 Jonas Ådahl 2019-09-03 08:52:02 UTC
Seems to be caught by some signal handler in libmozjs: https://github.com/ptomato/mozjs/blob/mozjs60/js/src/wasm/WasmSignalHandlers.cpp#L1733

Comment 6 Michael Catanzaro 2019-09-25 17:11:48 UTC
I think it's a F31 regression though. I was able to report many gnome-shell crashes in F30 with no problems.

If the mozjs signal handler is to blame, please coordinate with them to sort this out ASAP. We can't ship a non-debuggable desktop....

Comment 7 Michael Catanzaro 2019-09-25 21:02:38 UTC
(In reply to Michael Catanzaro from comment #6)
> If the mozjs signal handler is to blame

It's hard to read what it's doing, but it looks like it's designed to only catch wasm faults and nicely reraise the fatal signal for normal crashes.

Also, we've used mozjs60 since Fedora 29 so it seems unlikely that anything has changed here recently.

Comment 8 Adam Williamson 2019-09-26 01:09:28 UTC
I wonder if we're getting coredumps from *anything* that crashes? Try SIGSEGVing something that has nothing to do with GNOME, maybe...

Comment 9 Chris Murphy 2019-09-26 03:44:09 UTC
I'm getting coredumps if I 'sudo kill -s 11 <gnomemapspid>'. I also get a massive pile of AVCs

Sep 25 21:37:51 flap.local audit[8217]: AVC avc:  denied  { write } for  pid=8217 comm="abrt-action-lis" name=".dbenv.lock" dev="nvme0n1p7" ino=773614 scontext=system_u:system_r:abrt_t:s0-s0:c0.c1023 tcontext=unconfined_u:object_r:var_lib_t:s0 tclass=file permissive=0

coredumpctl lists it; and abrt lists it as well, but abrt says:
The auto-reporting feature is disabled. Please consider enabling it by issuing “abrt-auto-reporting enabled” as a user with root privileges.

And also I stumbled on what seems to be a significant issue with the retrace server...
https://github.com/abrt/retrace-server/issues/258

Comment 10 Michael Catanzaro 2019-09-26 15:45:18 UTC
(In reply to Adam Williamson from comment #8)
> I wonder if we're getting coredumps from *anything* that crashes? Try
> SIGSEGVing something that has nothing to do with GNOME, maybe...

coredumpctl is working fine. Sadly I have no shortage of crashes to prove this. :P

And yes, the retrace server is broken currently, but that is not related to this issue.

Comment 11 Jonas Ådahl 2019-09-26 19:18:40 UTC
Seems 

sudo setcap -r `which gnome-shell`

makes them come back.

Who's fault it is that cap_sys_nice+ep set during install eats core dumps, I don't know.

Comment 12 Michael Catanzaro 2019-09-26 21:09:38 UTC
Incredible.

So that capability is there to test the real-time scheduler, which is a mutter experimental feature off by default. Anyone enabling the experimental feature needs to edit a gsettings key. As a short-term solution, we might as well remove the capability and let people testing the real-time scheduler add it manually.

But as a long-term solution, I don't know. We probably want the real-time scheduler, but surely not at the cost of core dumps. I don't know.

Jonas: any objection to removing the capability from the RPM spec (until we find a better answer)?

Comment 13 Zbigniew Jędrzejewski-Szmek 2019-09-27 06:54:40 UTC
Jonas, kudos!

Unfortunately, this is intentional. core(5) says:
> There are various circumstances in which a core dump file is not produced:
> ...
> *  The process is executing a set-user-ID (set-group-ID) program that  is  owned
>    by  a user (group) other than the real user (group) ID of the process, or the
>    process is executing a program that  has  file  capabilities  (see  capabili‐
>    ties(7)).   (However,  see  the  description  of the prctl(2) PR_SET_DUMPABLE
>    operation, and the description  of  the  /proc/sys/fs/suid_dumpable  file  in
>    proc(5).)

gnome-shell should call prctl(SET_DUMPABLE, 1);
We know that it is OK for the user to have access to all capabalities/information of that process.
This will have the additional advantage that gnome-shell will be debuggable by the user.
Right now 'gdb -p $(pidof gnome-shell)' fails with EPERM.

Comment 14 Jonas Ådahl 2019-09-27 09:17:03 UTC
(In reply to Zbigniew Jędrzejewski-Szmek from comment #13)
> Jonas, kudos!

It was pointed out by someone on IRC, I just verified :)

> 
> Unfortunately, this is intentional. core(5) says:
> > There are various circumstances in which a core dump file is not produced:
> > ...
> > *  The process is executing a set-user-ID (set-group-ID) program that  is  owned
> >    by  a user (group) other than the real user (group) ID of the process, or the
> >    process is executing a program that  has  file  capabilities  (see  capabili‐
> >    ties(7)).   (However,  see  the  description  of the prctl(2) PR_SET_DUMPABLE
> >    operation, and the description  of  the  /proc/sys/fs/suid_dumpable  file  in
> >    proc(5).)
> 
> gnome-shell should call prctl(SET_DUMPABLE, 1);
> We know that it is OK for the user to have access to all
> capabalities/information of that process.
> This will have the additional advantage that gnome-shell will be debuggable
> by the user.
> Right now 'gdb -p $(pidof gnome-shell)' fails with EPERM.

Seems to do the trick indeed. Created https://gitlab.gnome.org/GNOME/mutter/merge_requests/811.

Comment 15 Fedora Update System 2019-09-28 17:14:19 UTC
FEDORA-2019-94130905d5 has been submitted as an update to Fedora 31. https://bodhi.fedoraproject.org/updates/FEDORA-2019-94130905d5

Comment 16 Fedora Update System 2019-09-29 01:11:40 UTC
mutter-3.34.0-5.fc31 has been pushed to the Fedora 31 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-94130905d5

Comment 17 Fedora Update System 2019-09-30 00:01:48 UTC
mutter-3.34.0-5.fc31 has been pushed to the Fedora 31 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.