1556831 – dies *with SIGTRAP*, when XWayland dies

Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.

Bug 1556831 - dies *with SIGTRAP*, when XWayland dies

Summary: dies *with SIGTRAP*, when XWayland dies

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	gnome-shell
Sub Component:
Version:	27
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Owen Taylor
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-03-15 10:41 UTC by Alan Jenkins
Modified:	2018-11-27 20:19 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-27 16:42:59 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1566193	0	unspecified	CLOSED	Any time GNOME Shell crashes because XWayland crashed, bug is erroneously marked as a dupe (1510059...)	2022-05-16 11:32:56 UTC

Internal Links: 1566193

Description Alan Jenkins 2018-03-15 10:41:08 UTC

Description of problem: gnome-shell dies *with SIGABRT*, when XWayland dies

Version-Release number of selected component (if applicable):
gnome-shell-3.26.2-4.fc27.x86_64
mutter-3.26.2-2.fc27.x86_64

How reproducible: always

Steps to Reproduce: killall /usr/bin/XWayland

Actual results: gnome-shell dies with SIGABRT

Expected results:

Abort represents a programming error in the current program.  Therefore, since I did not subvert the gnome-shell process, there is a programming error in gnome-shell.

It is possible that the death of XWayland is known fatal to gnome-shell.  In this case, the correct response is to log this fatal error and exit with EXIT_FAILURE or so, not SIGABRT.

For a known fatal error, there would be no reason to raise SIGABRT and invoke e.g. coredump handling infrastructure.  In this case there is no reason for a developer to analyze the core.

Cores cost disk IO and disk space, at a time when the system is already having behavioural issues.  (Aka "My desktop!  Nooo!", at the least).


Additional info:

I don't kill XWayland willy-nilly - the real issue was an XWayland crash.

Adding insult to injury, Fedora's ABRT decided to completely ignore the core generated when gnome-shell crashes.  So systemd-coredump has done 50MB of disk IO during a period of stress, and I can't even report the crash automatically.

Backtrace including some debuginfos:

#0  0x00007ff49ae4c050 in raise () at /lib64/libpthread.so.0
#1  0x00005634bd914a0b in dump_gjs_stack_on_signal_handler (signo=5) at ../src/main.c:372
#2  0x00007ff49ae4c1b0 in <signal handler called> () at /lib64/libpthread.so.0
#3  0x00007ff49cbed771 in _g_log_abort () at /lib64/libglib-2.0.so.0
#4  0x00007ff49cbee7ac in g_log_default_handler () at /lib64/libglib-2.0.so.0
#5  0x00005634bd914ae5 in default_log_handler (log_domain=0x7ff49b1732d3 "mutter", log_level=6, message=0x5634beeefa50 "Connection to xwayland lost", data=0x0) at ../src/main.c:315
#6  0x00007ff49cbeea3d in g_logv () at /lib64/libglib-2.0.so.0
#7  0x00007ff49cbeebaf in g_log () at /lib64/libglib-2.0.so.0
#8  0x00007ff49b13403e in x_io_error () at /lib64/libmutter-1.so.0
#9  0x00007ff499844ede in _XIOError () at /lib64/libX11.so.6
#10 0x00007ff49984276d in _XEventsQueued () at /lib64/libX11.so.6
#11 0x00007ff4998342bd in XPending () at /lib64/libX11.so.6
#12 0x00007ff49a399c2e in gdk_event_source_prepare () at /lib64/libgdk-3.so.0
#13 0x00007ff49cbe73f9 in g_main_context_prepare () at /lib64/libglib-2.0.so.0
#14 0x00007ff49cbe7dcb in g_main_context_iterate.isra () at /lib64/libglib-2.0.so.0
#15 0x00007ff49cbe8232 in g_main_loop_run () at /lib64/libglib-2.0.so.0
#16 0x00007ff49b0fc7bc in meta_run () at /lib64/libmutter-1.so.0
#17 0x00005634bd91442c in main (argc=<optimized out>, argv=<optimized out>) at ../src/main.c:530

Comment 1 Alan Jenkins 2018-03-15 18:04:28 UTC

Actually it's SIGTRAP, sorry.  I think the reasoning above is equally valid though.  5 other instances of this so far on FAF.

https://retrace.fedoraproject.org/faf/reports/2077342/

Comment 2 Alan Jenkins 2018-03-15 18:12:39 UTC

https://gitlab.gnome.org/GNOME/mutter/blob/7e17dd00/src/wayland/meta-xwayland.c#L417

x_io_error (Display *display)
{
  g_error ("Connection to xwayland lost");


> g_error()
> 
> #define             g_error(...)
> 
> A convenience function/macro to log an error message. The message should typically *not* be translated to the user's language.
>
> This is not intended for end user error reporting. Use of GError is preferred for that instead, as it allows calling functions to perform actions conditional on the type of error.
>
> Error messages are always fatal, resulting in a call to abort() to terminate the application. This function will result in a core dump; don't use it for errors you expect. Using this function indicates a bug in your program, i.e. an assertion failure.

Comment 3 Adam Williamson 2018-04-12 18:10:54 UTC

Oh, we have far more than 5! https://bugzilla.redhat.com/show_bug.cgi?id=1510059 is the bug which just about *every* occurrence of this on F27 gets marked as a dupe of by libreport. There are hundreds. https://bugzilla.redhat.com/show_bug.cgi?id=1469813 was the same thing for F26 (when this went down a slightly different codepath because dump_gjs_stack_on_signal_handler wasn't around).

I've been filing bugs about this for a while, looking at the issue from the libreport/satyr end mainly:

https://bugzilla.redhat.com/show_bug.cgi?id=1509086
https://bugzilla.redhat.com/show_bug.cgi?id=1566193
https://github.com/abrt/satyr/issues/271
https://github.com/abrt/satyr/issues/272

However, I think you have an excellent point that it seems at least prima facie reasonable for Shell to avoid dumping core at all when this happens, because it doesn't do anything much for us. When Shell is going down because it *knows* XWayland went away, there's very little point in producing, analyzing or examining a Shell core dump, AFAICS. What we need is the XWayland core dump. Assuming that's actually getting produced and showing up in libreport/abrt for reporting, I don't see what benefit at all we get from having Shell also dump core, and it seems like it'd make much more sense for mutter to just exit() there.

I've poked a few of the desktop team for comment on this, we had an IRC chat about it:

<adamw> mcatanzaro: mclasen: halfline: what do you guys think of https://bugzilla.redhat.com/show_bug.cgi?id=1556831 ? the reasoning kinda makes sense to me. is there a considered reason why shell explicitly aborts when it loses touch with wayland? could we change that so we don't get these fairly useless tracebacks?
<adamw> (assuming we'd get an xwayland crash report filed instead, which would likely be more useful)
<halfline> yea i presonally think it just adds noise
<halfline> same story on the other side
<mclasen> adamw: if it was easy to run without xwayland we would already do it. not sure it makes much of a difference which way we die
<halfline> the problem is whenever one side crashes both sides crash
<adamw> mclasen: the argument in the bug report is that shell should die in a way which doesn't cause abrt to kick in, basically
<halfline> and it takes effort to figure out which side crashed first
<halfline> we should suppress knock on crashes, since they're just noise not signal
<adamw> right, but this is the specific path where shell knows it lost connection to wayland...it's actually *intentionally written to abort* in that case
<ajax> adamw: the "considered" reason is that libX11's I/O error handler calls exit() and always has.
<adamw> it calls g_error("lost connection to xwayland") or whatever the message is, that's where we get all these abrt reports for "lost connection to xwayland" from
<ajax> (and that the x11 wm part of gnome-shell is the same process as the wayland server part of it)
<mcatanzaro> hmmm
<adamw> there's a direct link to the line in the bug: https://gitlab.gnome.org/GNOME/mutter/blob/7e17dd00/src/wayland/meta-xwayland.c#L417
<ajax> remarkably hard set of assumptions to unwind
<adamw> that's what he's suggesting changing
<mcatanzaro> We had a WebKit bug recently where the web process intentionally aborted if it lost connection to the network process
<mcatanzaro> Which should only happen when the network process crashes
<mcatanzaro> But the network process was not crashing
<mcatanzaro> This bug has caused something like 2000 crashes in the past couple days
<mcatanzaro> We would never have known if we removed the web process abort
<ajax> adamw: removing that i/o error handler won't help though
<adamw> ajax: what'd happen instead?
<mcatanzaro> The bug reporter was not impressed when I said the crash was intentional, and tried to convince me to change it to an exit() instead, but then we would have zero crash reports for this issue.
<adamw> mcatanzaro: the expectation here is we'd get reports for the *xwayland* crash
<ajax> even if g_error weren't fatal, if the handler returns to libX11, libX11 exits.
<halfline> but exit isn't a crash
<adamw> right
<halfline> exit is fine
<adamw> we're not expecting shell to magically *keep running* here
<adamw> this is just about not filing hundreds of not-very-useful bugs for shell "crashing" on this path
<ajax> ah, gotcha
<mcatanzaro> adamw: Yes of course that's the expectation... that was the expectation in the WebKit case too, that we'd get reports for the network process crash
<adamw> where i'm coming from here is https://bugzilla.redhat.com/show_bug.cgi?id=1510059#c303
<adamw> that is the bug which *every single crash of this kind in f27* is currently considered a duplicate of by libreport
<halfline> mcatanzaro: i'd almost rather miss an occasional bug than get flooded with noise
<halfline> s/almost//
<mcatanzaro> Clearly something needs to change, but it could just as easily be handled by ABRT
<halfline> doing what ?
<mcatanzaro> I guess making any changes to ABRT is probably too much to expect, though
<halfline> what change would you propose to make to abrt ?
<mcatanzaro> halfline: ABRT has logic somewhere to ignore expected crashes like this
<halfline> why would that be better?
<adamw> i have filed a satyr issue on this too
<halfline> if it's ignoring them
<halfline> versus them not happening ?
<mcatanzaro> I assume it could still count them, but not open a bunch of bugzilla bugs.
<adamw> but yeah, i agree with halfline, it doesn't seem obviously better to abort and then make libreport ignore the abort, versus just exiting
<mcatanzaro> Then if the count goes way up, we can say: hmmm, problem.
<halfline> mcatanzaro: what would the count tell you?
<halfline> yea but what problem?
<halfline> more likely the problem is Xwayland is crashing
<halfline> or something
<halfline> the count doesn't really help you
<halfline> since the Xwayland crash will get shown separately
<halfline> unless you're saying you look at the number of xwayland crashes and the count and see if tehre's a big discrepency ?
<halfline> we had a similar issue with gtk a while back btw
<adamw> yeah, the only problem i can see is if we for some reason *don't* get the xwayland crashes reported
<mcatanzaro> If Xwayland ever dies without leaving a core dump, or ABRT refuses to report the crash for whatever reason ("this backtrace is unusable" being a common culprit), then the XWayland crash won't be reported... anyway, it's fine either way, I'm just observing that we would have had a ton of trouble with this recent WebKit issue had we disabled the client process crash
<halfline> every time the display server went down every application would spam the log with a message saying as much
* adamw goes to look at xwayland crash reports, for that mayyer.
<halfline> totally not useful to see 50 apps all say "session is over" at the same time
<adamw> oh, yeah, we still get that with gnome :P
<adamw> but that's "just" logspam, at least it doesn't affect bugzilla.
<mcatanzaro> Ah good point, I forgot this happened once for every single application....

Comment 4 Adam Williamson 2018-04-12 18:30:46 UTC

https://gitlab.gnome.org/GNOME/mutter/merge_requests/76/diffs

Comment 5 Adam Williamson 2018-04-12 20:15:16 UTC

I do have a concern about whether XWayland crash reporting actually works reliably, per https://gitlab.gnome.org/GNOME/mutter/merge_requests/76#note_96597 .

Comment 6 Alan Jenkins 2018-04-12 20:54:56 UTC

The chat makes an interesting point, this can be argued the other way.

For me, it was quite an annoying process to click through reporting the backtrace, only to find out the real crash was in Xwayland...

> I do have a concern about whether XWayland crash reporting actually works reliably

...and I think in the series of ~10 crashes I was having recently, sometimes ABRT said the backtrace was bad and couldn't be reported as a bugzilla, or didn't come up with any Xwayland crash at all, or something :(.  I've been suspicious it was due to the secondary SIGBUS under xorg_backtrace(), but I don't honestly know.

I think the first time I tried ABRT to submit this Xwayland backtrace to bugzilla, the internet backtrace server failed.  And then ABRT succeeded when I asked it to download debuginfos and do it locally instead.  I noted this failure in the bugzilla it created[1].  No idea what that means, but it could explain some reduction in reports.  For the subsequent variant backtraces I stuck with that local option.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1557682

I did have one suggestion.  xorg_backtrace() seems completely pointless in Xwayland... I don't think the backtrace generation should be needed on Fedora _anywhere_.  But the primary reason for X handling fatal signals is to rescue the machine's display, so you don't have to reboot.  That's not needed if you have kernel mode setting... and KMS is an absolute requirement for Wayland.

I actually tried to turn off xorg_backtrace() in Xwayland - Xorg has an option to do that - but Xwayland doesn't actually read any user config :).

It's a suspiciously easy solution for a problem that's probably much harder.  But ISTM there's no reason to think it would hurt?  It would be one less thing in the way, one less special case that a new victim will encounter when they try to debug an Xwayland crash.

Comment 7 Alan Jenkins 2018-04-20 22:10:06 UTC

Hi Adam

Do you have an idea about the missing line numbers in most Xwayland traces on FAF?  Do you think it could cause ABRT not to want to create Bugzilla reports?

There seems to be missing line numbers in most Xwayland FAF traces I looked at.  E.g.

https://retrace.fedoraproject.org/faf/reports/2076637/
https://retrace.fedoraproject.org/faf/reports/2058692/

Alan

Comment 8 Ben Cotton 2018-11-27 16:05:21 UTC

This message is a reminder that Fedora 27 is nearing its end of life.
On 2018-Nov-30  Fedora will stop maintaining and issuing updates for
Fedora 27. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora  'version' of '27'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 27 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 9 Adam Williamson 2018-11-27 16:42:59 UTC

This got merged upstream, so it's probably 'fixed' as of 28 or 29.

Comment 10 Alan Jenkins 2018-11-27 20:19:43 UTC

Thanks for your update Adam.

I tried to reproduce this on Fedora 28, and I got SIGSEGV instead, not quite what we hoped for :-).  Hopefully I will also have a Fedora 29 VM soon...ish, to test a more recent version.

I opened the segfault as https://bugzilla.redhat.com/show_bug.cgi?id=1654009 .  Might not be what you wanted to hear, it was just the easiest thing for me to do with the backtrace.

Note You need to log in before you can comment on or make changes to this bug.