Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1933378 - Bootsplash doesn't always fully clear on boot to console
Summary: Bootsplash doesn't always fully clear on boot to console
Keywords:
Status: ASSIGNED
Alias: None
Product: Fedora
Classification: Fedora
Component: plymouth
Version: rawhide
Hardware: All
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Hans de Goede
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-26 23:46 UTC by Adam Williamson
Modified: 2021-05-28 22:26 UTC (History)
4 users (show)

Fixed In Version: plymouth-0.9.5-2.20210331git1ea1020.fc34
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-04-05 00:17:32 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
screenshot of the bug (6.85 KB, image/png)
2021-02-26 23:47 UTC, Adam Williamson
no flags Details

Description Adam Williamson 2021-02-26 23:46:57 UTC
In F34+ openQA tests, I'm sometimes seeing failures because the bootsplash isn't going away entirely when booting to console. These are minimal systems with just the "three white dots on a grey background" bootsplash; sometimes when we reach the login screen, the three white dots and the grey background are still there, instead of clearing to a blank black background. See attached screenshot.

This is a problem for our openQA tests as it breaks several needles that expect text on a black background, not a grey one.

Comment 1 Adam Williamson 2021-02-26 23:47:34 UTC
Created attachment 1759616 [details]
screenshot of the bug

Comment 2 Adam Williamson 2021-03-04 16:10:15 UTC
This does not seem to happen if the configured Plymouth theme (e.g. the default theme) is installed. It only happens if the configured theme is not installed, and so the fallback-y 'grey screen with two blobs on it' thing is used.

Comment 3 Hans de Goede 2021-03-04 16:24:02 UTC
(In reply to Adam Williamson from comment #2)
> This does not seem to happen if the configured Plymouth theme (e.g. the
> default theme) is installed. It only happens if the configured theme is not
> installed, and so the fallback-y 'grey screen with two blobs on it' thing is
> used.

Right, so I guess this has to do with the flavor of Fedora which is being tested? I would expect e.g. Fedora Workstation to always have the default theme installed.

The fallback "text" theme has not seen any serious work done on it for ages; and I personally have never touched that code.

I think it would be better for spins / flavors which don't install the default-theme to just disable plymouth altogether, either by removing "rhgb" from the kernel commandline, or by just not installing any parts of plymouth at all...

Comment 4 Adam Williamson 2021-03-04 17:46:25 UTC
Default Server install doesn't install the default theme. So I'm having lots of fun because the Server tests in openQA fail unpredictably due to this. I can hack around that in openQA, but people just running a default Server install are going to run into this in real life.

In comps, plymouth is 'mandatory' in the 'core' group, but 'plymouth-system-theme' is only in the 'base-x' group. I guess this is because installing a graphical theme pulls in quite a lot more dependencies than just the core of plymouth and the text theme, but if it's going to cause problems like this, I guess we can maybe move plymouth itself to base-x (so only graphical installs that likely include all the deps anyway will pull it in)?

Comment 5 Adam Williamson 2021-03-04 17:47:24 UTC
Proposing this as a BetaFE as it looks ugly and unpolished when it happens, and it'll happen in many important cases (Server, minimal...)

Comment 6 Adam Williamson 2021-03-04 17:56:58 UTC
Ah, hmm, well, there'd potentially be a problem with not installing plymouth: we may not get a decryption prompt on boot, if a system partition is encrypted. That is actually why we put plymouth in core (base at the time) way back in F17:

https://bugzilla.redhat.com/show_bug.cgi?id=794899

I guess we'd need to test an encrypted install without plymouth and see how it behaves.

Comment 7 Hans de Goede 2021-03-04 18:30:08 UTC
(In reply to Adam Williamson from comment #6)
> I guess we'd need to test an encrypted install without plymouth and see how
> it behaves.

Yes, I believe that systemd should handle asking for a password just fine without having plymouth installed.

So if you can test that and then move plymouth out of the core group that would seem best. It is sort of weird to have plymouth in the core group anyways. It is only supposed to be a bootsplash, not a core component.

Comment 8 Adam Williamson 2021-03-04 20:18:32 UTC
I've sent a mail to devel@ and server@ summarizing the situation and asking for opinions. I gotta be honest, I would be happiest if we could just fix the actual bug...

Comment 9 Adam Williamson 2021-03-04 20:19:16 UTC
CCing Bill to see if he remembers any more detail about why specifically we wanted plymouth in @core back then.

Comment 10 Bill Nottingham 2021-03-04 22:21:05 UTC
I don't really remember; the logic referenced (it was used for passphrase entering, and this was before systemd handled this more natively) certainly sounds correct.

I think at the time there was a base text screen that had a text progress bar, and this may predate FB/KMS/etc everywhere?

Comment 11 Adam Williamson 2021-03-05 01:34:28 UTC
https://pagure.io/fedora-comps/pull-request/631 would address this by dropping plymouth from @core and @standard (though the bug would still *exist*).

Comment 12 Hans de Goede 2021-03-05 09:15:37 UTC
(In reply to Adam Williamson from comment #11)
> https://pagure.io/fedora-comps/pull-request/631 would address this by
> dropping plymouth from @core and @standard (though the bug would still
> *exist*).

I agree that the bug would still exist, can you perhaps file an issue for this upstream:
https://gitlab.freedesktop.org/plymouth/plymouth/-/issues/

And also attach the screenshot of the issue there ?

Comment 13 Ray Strode [halfline] 2021-03-08 15:04:01 UTC
plymouth serves three purposes:

1) boot messages logger
2) disk encryption password
3) splash screen

the boot messages logger is a bit silly at this point, the journal is so much more useful than a bunch of "Starting foo.... [OK]" messages or whatever ends up in there now.
system.

systemd did indeed grow support for asking for a password without plymouth involved, so it's not too useful either, if the message is going to show up as a text prompt anyway.
(of course plymouth is still useful for showing a "pretty" disk encryption password screen, but that's not relevant for users using the fallback splash)

The splash screen will be graphical by default, but it can fall back to "text" even if graphical is enabled.  Say the user puts "nomodeset" on the kernel command line or something.

plymouth probably should clear the screen when its quit without --retain-splash, I suppose, but on the other handy, the agetty should be getting started with VTDisallocate=TRUE in 
the service file so it should get cleared on its own.  Did the service file change?

Comment 14 Ray Strode [halfline] 2021-03-08 15:12:57 UTC
The other possibility is the "plymouth --wait" call in plymouth-quit-wait service is returning early for some reason so plymouth is still running when agetty is starting up.

Comment 15 Ray Strode [halfline] 2021-03-08 15:15:16 UTC
adamw, can you try adding a 

ExecStartPost=/usr/bin/sleep 5

to plymouth-quit-wait.service and see if the problem vanishes ?

Comment 16 Ray Strode [halfline] 2021-03-08 15:20:57 UTC
hmm, the text splash has this:

static void•
destroy_plugin (ply_boot_splash_plugin_t *plugin)•
{•
...
        /* It doesn't ever make sense to keep this plugin on screen•
         * after exit•
         */•
        hide_splash_screen (plugin, plugin->loop);•
...
}• 

where hide_splash_screen calls view_hide which does this:

static void•
view_hide (view_t *view)•
{•              
        if (view->display != NULL) {•
                ply_terminal_t *terminal;•
•       
                terminal = ply_text_display_get_terminal (view->display);•
...
                ply_text_display_clear_screen (view->display);•
...
        }•
}•      

So things should be working, unless it's a case of getty starting before plymouth quits.

Comment 17 Ray Strode [halfline] 2021-03-08 15:26:22 UTC
so i think what we need is to see the boot with with verbose logging turned on to see the order of things.  I think the kernel command line should be something like:

systemd.log_target=kmsg systemd.log_level=debug plymouth.debug=stream:/dev/kmsg log_buf_len=1M printk.devkmsg=on

to get full logs to dmesg.

This will hopefully show us what systemd is doing, what plymouth is doing, and show if things are happening out of order.

adamw is this something you can inject into your reproduction environment?

Comment 18 Adam Williamson 2021-03-08 16:03:26 UTC
+3 in https://pagure.io/fedora-qa/blocker-review/issue/280 , marking accepted.

Comment 19 Adam Williamson 2021-03-08 16:03:47 UTC
Ray, I'll get to your notes later today, thanks!

Comment 20 Adam Williamson 2021-03-10 17:57:04 UTC
I've merged the comps PR now, which should address the FE element of this. Let's keep the report open for the actual bug, though.

Comment 21 Adam Williamson 2021-03-17 17:43:57 UTC
Note, changing this has (further) exposed another issue on aarch64: without plymouth installed, we don't get boot messages (and the decryption prompt) on the monitor on aarch64 systems, even booting with console=tty0. See https://bugzilla.redhat.com/show_bug.cgi?id=1940163 .

Comment 22 Hans de Goede 2021-03-26 13:19:23 UTC
While working on another kernel/plymouth integration issue (bug 1941329) I think I have figured out the problem here.

The tribar and text plugins both clear the screen from hide_views (), which calls ply_text_display_clear_screen () for each registered text-terminal/view (the console and a serial console can be 2 separate terminals/views).

This should be called on exit through this call tree:

quit_splash () -> ply_boot_splash_free () -> ply_boot_splash_unload () -> destroy_plugin() -> hide_splash_screen () -> hide_views ()

But ply_boot_splash_free () calls remove_pixel_displays () which unregisters all the terminals/views before calling ply_boot_splash_unload (), so when hide_views () runs and iterates over all the registered terminals to clear them, there are no registered terminals to clear.

Fixing this should be easy. I'll prepare a patch for this.

Comment 23 Hans de Goede 2021-03-26 22:46:53 UTC
Nevermind, main.c always calls hide_splash () before calling quit_splash () so that is not it.

One interesting observation, to reproduce I had to not just do:

sudo rm /lib64/plymouth/renderers/*
sudo dracut -f

But I also had to edit /lib/systemd/system/getty@.service and change the following 2 lines:

TTYReset=yes
TTYVTDisallocate=yes

From yes to no. I wonder if the server spin is somehow using a different getty configuration then workstation?

Comment 24 Adam Williamson 2021-03-26 23:28:30 UTC
Note, the problem doesn't happen every time, it's not even 50% of the time, IIRC. We got a handful of failures on this out of dozens of tests.

So unless you tried many times after each change, you may just have been seeing randomness and the changes may not have mattered.

Sorry, I never did get back to Ray's notes on debugging this after the comps change, but we should really fix this properly. I'll try and do them soon.

Comment 25 Hans de Goede 2021-03-27 15:32:03 UTC
(In reply to Adam Williamson from comment #24)
> Note, the problem doesn't happen every time, it's not even 50% of the time,
> IIRC. We got a handful of failures on this out of dozens of tests.

Interesting, I guess that the tty-clearing done by systemd sometimes failed. After I edited the getty service to disable the tty clearing done by systemd I can 100% reproduce this issue.

I've also found an easier reproducer which is to switch between the "text" splash and detailed mode while a diskcrypt password is shown, in this case the details mode will also show the background with plymouths "grey" background color except for the parts which are overwritten with actual details text, which do get a black background.

This reproduces has also allowed me to prepare a fix, which fixes both the detailed-screen and the login/getty screen grey background issue for me.

> Sorry, I never did get back to Ray's notes on debugging this after the comps
> change, but we should really fix this properly. I'll try and do them soon.

Actually specifying plymouth.debug=stream:... on the kernel commandline turns ply_text_display_clear_screen () into a no-op for debugging, breaking both the "text" splash, as well as details <-> text splash switching and also makes a mess of the login/getty screen (with systemd's tty clearing disabled).

Fixing the plymouth.debug=stream:... issue is the first thing which I ended up doing, as I was using that for debugging myself too.

I'll post a merge-req upstream with a bunch of fixes for this soon.

Comment 26 Hans de Goede 2021-03-27 15:58:11 UTC
Here is the promised merge-request which fixes both my reproducers for this:

https://gitlab.freedesktop.org/plymouth/plymouth/-/merge_requests/137

Adam, do you want a plymouth build to check if this also fixes the openQA tests issue you were seeing?  I'm pretty sure that I've got this fixed, but if you want to double check let me know.

Comment 27 Adam Williamson 2021-03-28 00:55:48 UTC
If you're pretty sure about it, I'm fine with just pulling the fix in. Then we might want to put plymouth back into at least ARM base installs, because of https://bugzilla.redhat.com/show_bug.cgi?id=1940163 ...

Comment 28 Hans de Goede 2021-03-31 19:03:29 UTC
A quick status update on this:

I've been also working on another plymouth fix for bug 1941329. I wanted to also get that bug fixed in the next Fedora plymouth build because the root-cause of it worries me a bit:

What I believe is happening here is that the "gracefully shutdown" of displays which the kernel's i915 driver recently introduced in commit fe0f1e3bfdfe ("drm/i915: Shut down displays gracefully on reboot") is getting undone. Because of being "unkillable" plymouthd keeps running and showing the spinner animation to the very end, this results in a drmModeSetCrtc () call after the i915 display driver has turned off the displays. This causes 2 issues:

1. This causes the screen to go black for 1-2 seconds and then show the plymouth screen again for 1-2 seconds on poweroff/reboot which looks ugly, this is bug 1941329.

2. This may cause issues with the attached monitors on reboot, since it undoes the gracefull shutdown which the i915 driver does, this is the bit with has me worried a bit and why I've been working on a fix for this.

I've now completed the fix for this, based on earlier work on a related shutdown issue from Ray:
https://gitlab.freedesktop.org/plymouth/plymouth/-/merge_requests/138

This related shutdown issue is the rootfs sometimes not being cleanly unmounted on shutdown/reboot and this is also resolved by this new merge-request. Which is another reason why I wanted to this fix included in the next Fedora plymouth build, the system sometimes not cleanly unmounting the rootfs is also something which really needs (needed) to be fixed.

I'm currently preparing a new plymouth build for F34+ which includes both fixes.

The second fix is not really trivial (but it should be ok / safe), so I will disable auto-pushing to stable for the F34 update because I'm not 100% sure if we want this in F34 final, or if it would be better as a 0 day update. I expect this to be fine for F34 final, but I want to wait till at least coming Sunday before pushing this to stable.

Comment 29 Hans de Goede 2021-04-01 06:02:13 UTC
Ugh, the build is stuck waiting for a backlog on the s390x builders.

Adam, if you want to get a head-start with testing this, you can find rpms for all arches except s390x here:

https://koji.fedoraproject.org/koji/taskinfo?taskID=64933897

(assuming you can somehow inject these in OpenQA through some local repo)

> Then we might want to put plymouth back into at least ARM base installs, because of https://bugzilla.redhat.com/show_bug.cgi?id=1940163 ...

I've no strong preference for pulling plymouth back in, or not. AFAIK plymouth does have the advantage of asking for the LUKS password on all available text displays (so both console + a serial tty) which the systemd replacement seems to not do.

Comment 30 Fedora Update System 2021-04-01 07:00:19 UTC
FEDORA-2021-cbe477064d has been submitted as an update to Fedora 34. https://bodhi.fedoraproject.org/updates/FEDORA-2021-cbe477064d

Comment 31 Hans de Goede 2021-04-01 11:29:05 UTC
Unfortunately the fix for bug 1941329 does not fully work when selinux is in enforcing mode, I've filed a bug against selinux-polciy for this, see bug 1945585.

selinux policy blocks plymouthd from executing the new plymouthd-drm-escrow helper which is there to keep the /dev/dri/card# fds open during shutdown. This means that when the system is ready to shutdown (or transfers control back to the initramfs if the rootfs storage needs some extra work for a clean shutdown), the text-console will be shown with the last couple of messages which is not the smooth shutdown/reboot look we want, but it is not really a big problem either.

Note the 2 initial issues which the fix for bug 1941329 targets are still fixed even with selinux in enforcing mode, all the selinux issue does is make the shutdown somewhat less pretty to look at.

IMHO the bugs fixed in the FEDORA-2021-cbe477064d plymouth update are more important to fix then this not-pretty but otherwise completely harmless shutdown/reboot problem; and hopefully we will be able to get a selinux fix for this as a 0-day update.

Comment 32 Adam Williamson 2021-04-01 16:51:19 UTC
I definitely want to get a new selinux-policy build in before release, so we may get it fixed for the release if Zdenek picks it up quickly.

Thanks for this, I'll see what I can do to test it before we push it stable; right now plymouth is still removed from comps so openQA will not really test the fix for the issue, but I should be able to hack things up a bit and test it out.

Comment 33 Hans de Goede 2021-04-01 17:38:23 UTC
(In reply to Adam Williamson from comment #32)
> I definitely want to get a new selinux-policy build in before release, so we
> may get it fixed for the release if Zdenek picks it up quickly.

Good news, I've been working on this with Zdenek and a fix is available now,
this should make the last pre-GA selinux-policy build.

> Thanks for this, I'll see what I can do to test it before we push it stable;
> right now plymouth is still removed from comps so openQA will not really
> test the fix for the issue, but I should be able to hack things up a bit and
> test it out.

Great, thank you.

Comment 34 Fedora Update System 2021-04-02 01:34:48 UTC
FEDORA-2021-cbe477064d has been pushed to the Fedora 34 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-cbe477064d`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-cbe477064d

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 35 Hans de Goede 2021-04-04 11:02:49 UTC
The update just got its first positive karma and in my own somewhat intensive testing on multiple machines everything works well too. So I would like to push this to stable before the freeze.

But before I hit the push-to-stable button I wanted to double check if that is ok here. Adam is pushing the plymouth update for this to stable ok with you ?

Comment 36 Adam Williamson 2021-04-04 15:11:09 UTC
sorry, I did not get to this before the weekend, and I'm actually supposed to be off monday and tuesday. um. but honestly I can't see any reason why not - there's no indication it breaks anything. it gives us the option to put plymouth back in by default, or we can just not. so, I'm gonna say fine by me.

Comment 37 Hans de Goede 2021-04-04 18:11:38 UTC
Pushed.

Thank you for the quick answer during the weekend and enjoy your time off.

Comment 38 Hans de Goede 2021-04-04 18:17:10 UTC
p.s. The selinux fix is now available here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-e221a38cfe and I can confirm that that fixes the mentioned selinux issue.

Comment 39 Fedora Update System 2021-04-05 00:17:32 UTC
FEDORA-2021-cbe477064d has been pushed to the Fedora 34 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 40 Adam Williamson 2021-04-06 21:14:02 UTC
https://pagure.io/fedora-comps/pull-request/642 is a PR to put plymouth back in core now this is (we hope) fixed.

Comment 41 Adam Williamson 2021-04-30 22:06:38 UTC
Well, bad news:

https://openqa.fedoraproject.org/tests/874504#step/_console_wait_login/6

still happening. That's from today's Rawhide.

Comment 42 Hans de Goede 2021-05-01 14:48:08 UTC
(In reply to Adam Williamson from comment #41)
> Well, bad news:
> 
> https://openqa.fedoraproject.org/tests/874504#step/_console_wait_login/6
> 
> still happening. That's from today's Rawhide.

So looking at the screenshot, this is different from the case which I've fixed. The case which I could reproduce and which I fixed had the entire background set to gray, which is the 3-dot text splash background color.

But on your screenshot the 3 text-dots are also still being shown, suggesting that plymouth either crashes or never quits.

Can you ssh into a test-vm running this test once it has booted (regardless of this happening) and check the journal for plymouth crashing; and also do a "ps aux | grep plymouth" and see if any plymouth processes are still running. I would also like to see the output of "systemctl status plymouth-quit.service"

Comment 43 Adam Williamson 2021-05-03 18:53:12 UTC
oh, yes, you're right. I'll try and grab those logs for you today.

Comment 44 Adam Williamson 2021-05-04 22:58:31 UTC
https://openqa.fedoraproject.org/tests/877856/file/role_deploy_domain_controller_check-var_log.tar.gz is a /var/log tarball from a run of another test in the same group (that starts out identically) where the bug didn't happen; journal shows no indication of Plymouth crashing that I can see. I'll tweak the test to also upload the other two things you wanted and run it again...

Comment 45 Adam Williamson 2021-05-05 01:20:35 UTC
This is what we get from `systemctl status plymouth-quit.service` (first) and `ps aux | grep plymouth` (second) run shortly after boot (this is from a test where the bug didn't happen). Nothing looks out of line. The service exited successfully and there's no plymouth process hanging around.

● plymouth-quit.service - Terminate Plymouth Boot Screen
     Loaded: loaded (/usr/lib/systemd/system/plymouth-quit.service; static)
     Active: active (exited) since Tue 2021-05-04 19:12:02 EDT; 14s ago
    Process: 754 ExecStart=/usr/bin/plymouth quit (code=exited, status=0/SUCCESS)
   Main PID: 754 (code=exited, status=0/SUCCESS)
        CPU: 8ms

May 04 19:12:02 fedora systemd[1]: Starting Terminate Plymouth Boot Screen...
May 04 19:12:02 fedora systemd[1]: Finished Terminate Plymouth Boot Screen.



root         809  0.0  0.0   6132   844 tty1     S+   19:12   0:00 grep --color=auto plymouth

Comment 46 Hans de Goede 2021-05-05 08:57:52 UTC
Adam, thank you for the logs. Unfortunately I still have no clue what is going on here.

This is esp. weird because the systemd getty handling will also clear the console itself, as I mentioned earlier to reproduce the problem at all I had to edit /lib/systemd/system/getty@.service and change the following 2 lines:

TTYReset=yes
TTYVTDisallocate=yes

From yes to no.

So for some reason the clearing by systemd also does not seem to happen, maybe because plymouth is hanging and has some level of ownership which systemd tries to honor ?  If plymouth was doing a draw after the systemd-clear then its background should not be gray but black, since plymouth does not reset the palette to its own palette/colors on each draw; and the background of the getty drawn text is black so something has reset the palette to the defaults before getty outputs the login banner.

All in all this is very weird.

Comment 47 Adam Williamson 2021-05-05 18:29:12 UTC
Gotcha. I don't think it's happening as often as the previous case; that one from #c41 is I think the only case I've seen so far, actually. I'll try to keep an eye out for it happening again, but the previous case we were seeing it happen several times for each compose tested.

Comment 48 Adam Williamson 2021-05-28 22:26:11 UTC
it has happened a few more times :/ here's one from today's Rawhide:

https://openqa.fedoraproject.org/tests/898115#step/_console_wait_login/6


Note You need to log in before you can comment on or make changes to this bug.