Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1878892 - Unattended reboots broken with latest clevis/dracut
Summary: Unattended reboots broken with latest clevis/dracut
Keywords:
Status: ASSIGNED
Alias: None
Product: Fedora
Classification: Fedora
Component: clevis
Version: 34
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Sergio Correia
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-14 19:41 UTC by Ben Webb
Modified: 2021-07-12 23:52 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-25 16:44:28 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
log from journalctl -u clevis-luks-askpass (1.63 KB, text/plain)
2020-09-14 20:08 UTC, Ben Webb
no flags Details
Output of "journalctl -b0 | grep 'Password Requests'" (1.97 KB, text/plain)
2020-09-21 22:11 UTC, Ben Webb
no flags Details
Full initrd journal (369.65 KB, text/plain)
2021-06-07 19:18 UTC, Ben Webb
no flags Details
Image at the point where the boot gets stuck (564.38 KB, image/jpeg)
2021-06-29 20:54 UTC, Ben Webb
no flags Details
Image at the point where the boot gets stuck, with rd.debug (853.00 KB, image/jpeg)
2021-06-29 20:54 UTC, Ben Webb
no flags Details
initrd journal with rd_NO_LUKS set (9.14 MB, text/plain)
2021-07-03 00:00 UTC, Ben Webb
no flags Details
initrd journal using rd.luks.name (8.37 MB, text/plain)
2021-07-03 00:01 UTC, Ben Webb
no flags Details
Image at the point where the boot gets stuck, with rd.auto (850.99 KB, image/jpeg)
2021-07-07 21:24 UTC, Ben Webb
no flags Details
Kickstart for basic F34 Server environment to reproduce (1021 bytes, text/plain)
2021-07-12 23:25 UTC, Ben Webb
no flags Details

Description Ben Webb 2020-09-14 19:41:28 UTC
Description of problem:
Since applying the latest clevis update, unattended reboots hang (at least some of the time) at unlocking the root device, until the user hits Enter.

Version-Release number of selected component (if applicable):
clevis-14-1.fc32.x86_64


How reproducible:
About 30% of the time, perhaps? When applying kernel-5.8.7-200.fc32.x86_64 in combination with clevis-14-1.fc32.x86_64, four of my ~12 or so desktops failed in this fashion. (The others booted normally.) I don't know at this point whether the issue is specific to these machines (seems unlikely; all machines are similar although not identical Dell desktops) or just occurs randomly (e.g. race condition).

Steps to Reproduce:
1. Set up clevis/dracut to unlock the encrypted root device on boot:
dnf install clevis clevis-luks clevis-dracut
cfg='{"t":1,"pins":{"tang":[{"url":"http://a.b.c.d"},{"url":"http://e.f.g.h"}]}}'
clevis luks bind -d /dev/md127 sss $cfg
dracut -f

2. Reboot the machine.

Actual results:
1. "Please enter passphrase for disk luks-xxx" prompt pops up.
2. Network is brought up (I boot without rhgb quiet so see the module output here).
3. Following prompts:
       Starting Forward Password Requests to Clevis...
[ OK ] Finished Forward Password Requests to Clevis.
4. Another "Please enter passphrase for disk luks-xxx" prompt, which hangs (I left this for over 8 hours).

If I hit Enter (empty passphrase) at this point boot proceeds normally, so the device is clearly being unlocked via clevis. But obviously this isn't ideal as it breaks an unattended reboot.

Expected results:
The "Please enter passphrase for disk luks-xxx" prompt pops up, then the disk is unlocked via clevis, then boot proceeds normally without user intervention. This happened with the previous clevis release, i.e. before applying the latest clevis-14-1.fc32.x86_64 package.

Comment 1 Sergio Correia 2020-09-14 19:58:46 UTC
(In reply to Ben Webb from comment #0)
> Description of problem:
> Since applying the latest clevis update, unattended reboots hang (at least
> some of the time) at unlocking the root device, until the user hits Enter.
> 
> Version-Release number of selected component (if applicable):
> clevis-14-1.fc32.x86_64

Would you be able to check if clevis-14-4.fc32 changes anything? It's currently in [testing]: https://koji.fedoraproject.org/koji/buildinfo?buildID=1607396

> 
> 
> How reproducible:
> About 30% of the time, perhaps? When applying kernel-5.8.7-200.fc32.x86_64
> in combination with clevis-14-1.fc32.x86_64, four of my ~12 or so desktops
> failed in this fashion. (The others booted normally.) I don't know at this
> point whether the issue is specific to these machines (seems unlikely; all
> machines are similar although not identical Dell desktops) or just occurs
> randomly (e.g. race condition).
> 
> Steps to Reproduce:
> 1. Set up clevis/dracut to unlock the encrypted root device on boot:
> dnf install clevis clevis-luks clevis-dracut
> cfg='{"t":1,"pins":{"tang":[{"url":"http://a.b.c.d"},{"url":"http://e.f.g.
> h"}]}}'
> clevis luks bind -d /dev/md127 sss $cfg
> dracut -f
> 

Are you unlocking a single device? In this case the root device?


> 2. Reboot the machine.
> 
> Actual results:
> 1. "Please enter passphrase for disk luks-xxx" prompt pops up.
> 2. Network is brought up (I boot without rhgb quiet so see the module output
> here).
> 3. Following prompts:
>        Starting Forward Password Requests to Clevis...
> [ OK ] Finished Forward Password Requests to Clevis.

Could you provide the log from journalctl -u clevis-luks-askpass?

Comment 2 Ben Webb 2020-09-14 20:08:42 UTC
Created attachment 1714849 [details]
log from journalctl -u clevis-luks-askpass

Comment 3 Ben Webb 2020-09-14 20:12:37 UTC
(In reply to Sergio Correia from comment #1)
> Would you be able to check if clevis-14-4.fc32 changes anything?

Sure, I can try that with the next kernel update... but it won't be until next week as we're not on site that often due to covid/work from home.

> Are you unlocking a single device? In this case the root device?

Yes to both.

> Could you provide the log from journalctl -u clevis-luks-askpass?

Sure, I added that as an attachment. I rebooted the machine at 21:31 and it got stuck until I came in the next day at 10:44 to hit Enter.

Comment 4 Sergio Correia 2020-09-17 12:52:41 UTC
(In reply to Ben Webb from comment #3)
> (In reply to Sergio Correia from comment #1)
> > Would you be able to check if clevis-14-4.fc32 changes anything?
> 
> Sure, I can try that with the next kernel update... but it won't be until
> next week as we're not on site that often due to covid/work from home.
> 
> > Are you unlocking a single device? In this case the root device?
> 
> Yes to both.
> 
> > Could you provide the log from journalctl -u clevis-luks-askpass?
> 
> Sure, I added that as an attachment. I rebooted the machine at 21:31 and it
> got stuck until I came in the next day at 10:44 to hit Enter.


Thanks. I believe I reproduced the issue here and I am testing a tentative fix.

If you could test this scratch build and report back the results I would appreciate: https://koji.fedoraproject.org/koji/taskinfo?taskID=51609430
In case you try this package, please rebuild your initramfs with dracut -f before rebooting.

Comment 5 Ben Webb 2020-09-17 17:57:24 UTC
(In reply to Sergio Correia from comment #4)
> Thanks. I believe I reproduced the issue here and I am testing a tentative
> fix.
> 
> If you could test this scratch build and report back the results I would
> appreciate

Will do!

Comment 6 Renaud Métrich 2020-09-17 20:00:43 UTC
It seems to fix the issue, but I can see the Passphrase prompt remains there forever (until GDM starts).

Comment 7 Sergio Correia 2020-09-17 20:10:51 UTC
(In reply to Renaud Métrich from comment #6)
> It seems to fix the issue, 

Thanks for testing!

> but I can see the Passphrase prompt remains there
> forever (until GDM starts).

Do you mean the plymouth prompt? If so, there is this BZ to track this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1672369
Reported upstream here as well: https://gitlab.freedesktop.org/plymouth/plymouth/-/issues/126

Comment 8 Renaud Métrich 2020-09-17 20:21:27 UTC
Plymouth prompt but also "Please enter passphrase for unlocking ..." in text mode.
I need to dig more if it's related, since after switching to "network-legacy" dracut module (due to a NetworkManager issue), the prompt was gone on next reboot.

Comment 9 Renaud Métrich 2020-09-17 20:24:01 UTC
Sorry, also happens with "network-legacy", so may be related to the fix.

Comment 10 Ben Webb 2020-09-21 21:00:59 UTC
(In reply to Sergio Correia from comment #4)
> If you could test this scratch build and report back the results I would
> appreciate: https://koji.fedoraproject.org/koji/taskinfo?taskID=51609430

Unfortunately that build doesn't work either - same issue, some of my machines get stuck during boot but will continue to boot normally if someone physically hits "Enter". Applied via:

$ rpm -q clevis
clevis-14-4.fc32.bz1878892.x86_64
$ rpm -q kernel
kernel-5.8.7-200.fc32.x86_64
kernel-5.8.8-200.fc32.x86_64
kernel-5.8.9-200.fc32.x86_64
$ sudo dracut -f 5.8.9-200.fc32.x86_64
$ sudo reboot

Comment 11 Sergio Correia 2020-09-21 21:26:39 UTC
(In reply to Ben Webb from comment #10)
> (In reply to Sergio Correia from comment #4)
> > If you could test this scratch build and report back the results I would
> > appreciate: https://koji.fedoraproject.org/koji/taskinfo?taskID=51609430
> 
> Unfortunately that build doesn't work either - same issue, some of my
> machines get stuck during boot but will continue to boot normally if someone
> physically hits "Enter". Applied via:
> 
> $ rpm -q clevis
> clevis-14-4.fc32.bz1878892.x86_64

Did you update all clevis packages?
$ rpm -qa | grep clevis

Comment 12 Ben Webb 2020-09-21 21:28:33 UTC
(In reply to Sergio Correia from comment #11)
> Did you update all clevis packages?
> $ rpm -qa | grep clevis

Yes.

$ rpm -qa|grep clevis
clevis-systemd-14-4.fc32.bz1878892.x86_64
clevis-dracut-14-4.fc32.bz1878892.x86_64
clevis-14-4.fc32.bz1878892.x86_64
clevis-luks-14-4.fc32.bz1878892.x86_64

Comment 13 Sergio Correia 2020-09-21 21:35:57 UTC
(In reply to Ben Webb from comment #12)
> (In reply to Sergio Correia from comment #11)
> > Did you update all clevis packages?
> > $ rpm -qa | grep clevis
> 
> Yes.
> 
> $ rpm -qa|grep clevis
> clevis-systemd-14-4.fc32.bz1878892.x86_64
> clevis-dracut-14-4.fc32.bz1878892.x86_64
> clevis-14-4.fc32.bz1878892.x86_64
> clevis-luks-14-4.fc32.bz1878892.x86_64

Alright. Could you please share the output of "journalctl -b0 | grep 'Password Requests'" from one of the machines that presented this issue?

Comment 14 Ben Webb 2020-09-21 22:11:19 UTC
Created attachment 1715605 [details]
Output of "journalctl -b0 | grep 'Password Requests'"

Comment 15 Fedora Program Management 2021-04-29 16:37:44 UTC
This message is a reminder that Fedora 32 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '32'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 32 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 16 Ben Cotton 2021-05-25 16:44:28 UTC
Fedora 32 changed to end-of-life (EOL) status on 2021-05-25. Fedora 32 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 17 Ben Webb 2021-05-25 17:14:11 UTC
FWIW, the issue (or a very similar one) is still present in F33 and F34 (clevis-18-1.fc34.x86_64) - about 20-30% of my machines do not boot.  Everything works very reliably though if I pin clevis to 13-1 (although BTW recently I've had to pin jose to 10-9 too; older clevis does not work with jose-11-1.fc34.x86_64, failing at boot with "JWE missing required 'clevis.tang.adv' header parameter!" suggesting that newer jose is missing a Conflicts: or similar with older clevis).

Happy to provide more information if you need it to help track down this issue - although I am rarely on site.

Comment 18 Sergio Correia 2021-05-25 19:11:35 UTC
Thanks for reopening this, Ben. Unfortunately I have been unable to reproduce it, but I have a suspicion on what could have caused the issue, so I will prepare a scratch build and it would be much appreciated if you could test it out. Good point about jose and "Conflicts", also.

Comment 19 Sergio Correia 2021-05-26 02:30:56 UTC
@Ben: It would be great if you could test this build and report back whether it helps with the issue at hand: https://koji.fedoraproject.org/koji/taskinfo?taskID=68750126 -- please rebuild the initramfs after installing the updated packages.

Comment 20 Ben Webb 2021-05-26 20:06:15 UTC
Fortuitous timing! I was on site today. Was able to apply this scratch build to all but one of my machines (the one holdout is still on F33), rebuild the initrd, and every machine now boots normally. So I think you got it. Thanks!

For reference here's the package load on one machine:

$ rpm -qa|grep 'clevis\|jose'
clevis-pin-tpm2-0.3.0-1.fc34.x86_64
libjose-11-1.fc34.x86_64
jose-11-1.fc34.x86_64
clevis-18-1.fc34.bz1878892.01.x86_64
clevis-luks-18-1.fc34.bz1878892.01.x86_64
clevis-systemd-18-1.fc34.bz1878892.01.x86_64
clevis-dracut-18-1.fc34.bz1878892.01.x86_64
$ uname -r
5.12.6-300.fc34.x86_64

Comment 21 Sergio Correia 2021-05-27 17:34:45 UTC
Good news, thanks for testing!

For reference, in this scratch build I reverted upstream commit cbb64c4 ("dracut: favour systemd units over dracut hooks")[1].
@Jonathan, would you have any insights here?

[1] https://github.com/latchset/clevis/commit/cbb64c4efb66b57a07fe1afbee10b012523ab71c

Comment 22 Jonathan Lebon 2021-06-02 15:52:14 UTC
Just to make sure I understand correctly, is the latest issue you're hitting still that the systemd clevis service *does* unlock the device, but for whatever reason, you still have to press Enter before it goes forward?

Can you provide the full journal logs of the initrd when that happens?

Can you also provide the versions of dracut and systemd? Can you verify that your initrd contains all of: `cryptsetup.target`, `cryptsetup-pre.target`, and `remote-cryptsetup.target`? (You can do e.g. `lsinitrd | grep cryptsetup`).

Comment 23 Ben Webb 2021-06-04 19:13:20 UTC
(In reply to Jonathan Lebon from comment #22)
> Just to make sure I understand correctly, is the latest issue you're hitting
> still that the systemd clevis service *does* unlock the device, but for
> whatever reason, you still have to press Enter before it goes forward?

That was certainly the issue with F32, which was never resolved. I can confirm that the system still doesn't boot with F34, but can't be 100% sure it was the same issue. I can check next time I'm on site and get the logs/versions you're asking for (the machines all boot right now because they are using the scratch build Sergio provided; I will need to 'break' one first).

> Can you provide the full journal logs of the initrd when that happens?

journalctl --system ? Or something else?

Comment 24 Jonathan Lebon 2021-06-04 19:21:59 UTC
(In reply to Ben Webb from comment #23)
> (In reply to Jonathan Lebon from comment #22)
> > Just to make sure I understand correctly, is the latest issue you're hitting
> > still that the systemd clevis service *does* unlock the device, but for
> > whatever reason, you still have to press Enter before it goes forward?
> 
> That was certainly the issue with F32, which was never resolved. I can
> confirm that the system still doesn't boot with F34, but can't be 100% sure
> it was the same issue. I can check next time I'm on site and get the
> logs/versions you're asking for (the machines all boot right now because
> they are using the scratch build Sergio provided; I will need to 'break' one
> first).

Gotcha. Yeah, would be good to make sure it's the same failure or something else happening here.

> > Can you provide the full journal logs of the initrd when that happens?
> 
> journalctl --system ? Or something else?

Yes, something like `journalctl --system -b 0`. If you're in the initramfs emergency prompt, it might be hard for you to get the logs off unless you can connect to the machine from another machine via serial, or since you should have networking in the initramfs (since you're using Tang pinning), you might be able to POST it someplace if those machines have public access (see example curl invocations in https://paste.centos.org/api).

Comment 25 Jonathan Lebon 2021-06-04 19:29:14 UTC
(In reply to Jonathan Lebon from comment #24)
> (see example curl invocations in https://paste.centos.org/api).

Ahh too bad, that actually requires an API key and at least this CentOS instance doesn't let you request your own key AFAICT. There's probably other pastebins out there which do if you really need to do this.

Comment 26 Ben Webb 2021-06-07 19:18:03 UTC
Created attachment 1789276 [details]
Full initrd journal

Comment 27 Ben Webb 2021-06-07 19:30:07 UTC
(In reply to Jonathan Lebon from comment #22)
> Can you provide the full journal logs of the initrd when that happens?

See attachment https://bugzilla.redhat.com/attachment.cgi?id=1789276

> Can you also provide the versions of dracut and systemd? Can you verify that
> your initrd contains all of: `cryptsetup.target`, `cryptsetup-pre.target`,
> and `remote-cryptsetup.target`? (You can do e.g. `lsinitrd | grep
> cryptsetup`).

This is what I ran on one of my affected systems:

# dnf distro-sync --refresh
# rpm -q dracut
dracut-054-12.git20210521.fc34.x86_64
# rpm -q clevis
clevis-18-1.fc34.x86_64
# rpm -q systemd
systemd-248.3-1.fc34.x86_64
# uname -r
5.12.8-300.fc34.x86_64
# dracut -f
# lsinitrd /boot/initramfs-5.12.8-300.fc34.x86_64.img | grep cryptsetup
drwxr-xr-x   2 root     root            0 May 21 05:25 etc/systemd/system/cryptsetup.target.wants
lrwxrwxrwx   1 root     root           48 May 21 05:25 etc/systemd/system/cryptsetup.target.wants/clevis-luks-askpass.path -> /usr/lib/systemd/system/clevis-luks-askpass.path
-rwxr-xr-x   1 root     root       491256 May 21 05:25 usr/lib64/libcryptsetup.so.12.6.0
lrwxrwxrwx   1 root     root           35 May 21 05:25 usr/lib64/libcryptsetup.so.12 -> ../../lib64/libcryptsetup.so.12.6.0
-rw-r--r--   1 root     root          473 May 15 09:33 usr/lib/systemd/system/cryptsetup-pre.target
-rw-r--r--   1 root     root          420 May 15 09:33 usr/lib/systemd/system/cryptsetup.target
-rwxr-xr-x   1 root     root        70168 May 15 10:14 usr/lib/systemd/systemd-cryptsetup
-rwxr-xr-x   1 root     root        41280 May 15 10:14 usr/lib/systemd/system-generators/systemd-cryptsetup-generator
lrwxrwxrwx   1 root     root           27 May 21 05:25 usr/lib/systemd/system/initrd-root-device.target.wants/remote-cryptsetup.target -> ../remote-cryptsetup.target
-rw-r--r--   1 root     root          557 May 15 09:33 usr/lib/systemd/system/remote-cryptsetup.target
lrwxrwxrwx   1 root     root           20 May 21 05:25 usr/lib/systemd/system/sysinit.target.wants/cryptsetup.target -> ../cryptsetup.target
-rw-r--r--   1 root     root           35 May 21 05:25 usr/lib/tmpfiles.d/cryptsetup.conf
-rwxr-xr-x   1 root     root       142104 May 21 05:25 usr/sbin/cryptsetup
# reboot

The boot hangs at the prompt for the password for the luks volume. On this occasion, hitting Enter doesn't do anything (I just get another identical prompt). Eventually it times out, and after entering the root password I get the emergency shell. I ran journalctl --system -b 0 in that shell and uploaded the result (with mild redaction of hostname/IP) here.

I can't claim to know what's going on here, but it certainly looks to me as if NetworkManager is being called too early (or with too short a timeout) and it gives up trying to configure the network before the NIC gets a link. I've previously hacked around that (see https://bugzilla.redhat.com/show_bug.cgi?id=1702524#c23) but that hack only worked with older clevis (perhaps because it only "fixed" the dracut hook, not the systemd service?)

Comment 28 Jonathan Lebon 2021-06-08 20:27:39 UTC
Hmm, I think you're missing `rd.luks.options=_netdev`. This ensures that networking is up before systemd tries to unlock the LUKS device. Can you try adding that to the kernel command-line?

Comment 29 Ben Webb 2021-06-08 21:29:23 UTC
(In reply to Jonathan Lebon from comment #28)
> Hmm, I think you're missing `rd.luks.options=_netdev`.

I believe that I tried adding `_netdev` to the root filesystem in `/etc/fstab` in the past with no effect - is this doing something different? (You no doubt saw that I also have `rd.neednet=1`.) But I'll try this next time I'm on site and report back, thanks!

Comment 30 Jonathan Lebon 2021-06-09 15:13:16 UTC
(In reply to Ben Webb from comment #29)
> (In reply to Jonathan Lebon from comment #28)
> > Hmm, I think you're missing `rd.luks.options=_netdev`.
> 
> I believe that I tried adding `_netdev` to the root filesystem in
> `/etc/fstab` in the past with no effect - is this doing something different?

Yes, they're related but different. You'll want to put it on the block device itself (it can be added as an option in /etc/crypttab, but you can also use rd.luks.options). Otherwise, systemd won't know that unlocking requires networking, even if the subsequent filesystem mount itself is marked as such.

Comment 31 Ben Webb 2021-06-11 23:55:05 UTC
(In reply to Jonathan Lebon from comment #28)
> Hmm, I think you're missing `rd.luks.options=_netdev`. This ensures that
> networking is up before systemd tries to unlock the LUKS device. Can you try
> adding that to the kernel command-line?

Unfortunately while that made a difference, it made things worse, not better! With that option, I don't see any output related to clevis unlocking, and no LUKS password prompt; it just gets stuck after bringing up the network (which incidentally has never been an issue - I can always ping these machines when they get stuck). I left the machine for an hour or so and it didn't fail over to the emergency shell, so no journalctl output, sorry (although I do have a photo of the screen if that helps!)

FWIW, the last output I see on boot is NetworkManager starting up (looks OK, ends up in CONNECTED_GLOBAL state) followed by "Finished nm-wait-online-initrd.service" and "Starting dracut initqueue hook..."

This is with the same clevis/dracut/systemd as comment #27, although now with kernel 5.12.9-300.fc34.x86_64.

Comment 32 Jonathan Lebon 2021-06-28 18:13:30 UTC
(In reply to Ben Webb from comment #31)
> (In reply to Jonathan Lebon from comment #28)
> > Hmm, I think you're missing `rd.luks.options=_netdev`. This ensures that
> > networking is up before systemd tries to unlock the LUKS device. Can you try
> > adding that to the kernel command-line?
> 
> Unfortunately while that made a difference, it made things worse, not
> better! With that option, I don't see any output related to clevis
> unlocking, and no LUKS password prompt; it just gets stuck after bringing up
> the network (which incidentally has never been an issue - I can always ping
> these machines when they get stuck). I left the machine for an hour or so
> and it didn't fail over to the emergency shell, so no journalctl output,
> sorry (although I do have a photo of the screen if that helps!)

Yeah, picture would help if you still have it!

> FWIW, the last output I see on boot is NetworkManager starting up (looks OK,
> ends up in CONNECTED_GLOBAL state) followed by "Finished
> nm-wait-online-initrd.service" and "Starting dracut initqueue hook..."

Hmm, can you try booting with rd.debug so we can try to have a peak into what dracut is waiting on?

Comment 33 Ben Webb 2021-06-29 20:54:13 UTC
Created attachment 1796022 [details]
Image at the point where the boot gets stuck

Comment 34 Ben Webb 2021-06-29 20:54:41 UTC
Created attachment 1796023 [details]
Image at the point where the boot gets stuck, with rd.debug

Comment 35 Ben Webb 2021-06-29 21:00:23 UTC
(In reply to Jonathan Lebon from comment #32)
> Yeah, picture would help if you still have it!

See attachment 1796022 [details].

> Hmm, can you try booting with rd.debug so we can try to have a peak into
> what dracut is waiting on?

I added `rd.shell rd.debug log_buf_len=1M rd.luks.options=_netdev` to the command line and it still refused to give me an emergency shell (although I only left it for ~10 minutes this time) so you get a low-tech photo again. See attachment 1796023 [details]. This is with dracut-055-2.fc34.x86_64, clevis-18-1.fc34.x86_64, systemd-248.3-1.fc34.x86_64, kernel 5.12.11-300.fc34.x86_64.

Comment 36 Jonathan Lebon 2021-06-30 14:27:45 UTC
Hmm OK yes, I think I see what's going on. I think the 90crypt dracut module is conflicting with systemd-cryptsetup-generator.

As a test, can you try using rd_NO_LUKS? 90crypt respects this but not systemd-cryptsetup-generator.

Are you using `rd.luks.uuid=<UUID>` by change? As a second independent test, can you instead try using `rd.luks.name=$UUID=$NAME`? 90crypt doesn't handle that karg, but systemd-cryptsetup-generator does.

Comment 37 Ben Webb 2021-07-03 00:00:39 UTC
Created attachment 1797373 [details]
initrd journal with rd_NO_LUKS set

Comment 38 Ben Webb 2021-07-03 00:01:26 UTC
Created attachment 1797374 [details]
initrd journal using rd.luks.name

Comment 39 Ben Webb 2021-07-03 00:05:39 UTC
(In reply to Jonathan Lebon from comment #36)
> As a test, can you try using rd_NO_LUKS? 90crypt respects this but not
> systemd-cryptsetup-generator.

With that it looks like dracut can't find the root device. Eventually it gives up and gives me an emergency shell. Contents of `journalctl --system -b 0` as attachment 1797373 [details].

> Are you using `rd.luks.uuid=<UUID>` by change? As a second independent test,
> can you instead try using `rd.luks.name=$UUID=$NAME`? 90crypt doesn't handle
> that karg, but systemd-cryptsetup-generator does.

I am, so I tried rd.luks.name as you suggest. This also results in not being able to find the root device, attachment 1797374 [details].

Comment 40 Jonathan Lebon 2021-07-05 19:12:24 UTC
Can you describe your setup? If I understand correctly from your logs, it seems like you're doing: root and swap on LVM on LUKS on RAID. Is that right?

I think the problem now is that we're hitting a dependency cycle very similar to https://github.com/dracutdevs/dracut/pull/931:
- systemd-cryptsetup is waiting for remote-fs-pre.target
- remote-fs-pre.target is waiting for dracut-initqueue.service
- dracut-initqueue.service is waiting for /dev/vg0/root to appear
- /dev/vg0/root is on LUKS, so won't appear until systemd-cryptsetup runs

As another test, on top of rd_NO_LUKS, can you also add rd.auto and remove rd.lvm.lv=vg0/root and rd.lvm.lv=vg0/swap?

Comment 41 Ben Webb 2021-07-07 21:24:24 UTC
Created attachment 1799438 [details]
Image at the point where the boot gets stuck, with rd.auto

Comment 42 Ben Webb 2021-07-07 21:30:21 UTC
(In reply to Jonathan Lebon from comment #40)
> Can you describe your setup? If I understand correctly from your logs, it
> seems like you're doing: root and swap on LVM on LUKS on RAID. Is that right?

Yes. Two physical disks each with three partitions. /boot and /boot/efi are each one partition on each disk mirrored with mdraid; the remaining partition is also a RAID 1, with as you said, LVM on LUKS on it. The LUKS device is unlocked with clevis talking to one of a pair of tang servers.

> As another test, on top of rd_NO_LUKS, can you also add rd.auto and remove
> rd.lvm.lv=vg0/root and rd.lvm.lv=vg0/swap?

This also gets stuck at "Remote Encrypted Volumes" but it does look like it's at least trying to unlock the root device. I didn't get a shell after 15 minutes or so, so attachment #1799438 [details] has a photo.

Comment 43 Ben Webb 2021-07-12 23:25:37 UTC
Created attachment 1801000 [details]
Kickstart for basic F34 Server environment to reproduce

Comment 44 Ben Webb 2021-07-12 23:52:29 UTC
(In reply to Jonathan Lebon from comment #40)
> Can you describe your setup? If I understand correctly from your logs, it
> seems like you're doing: root and swap on LVM on LUKS on RAID. Is that right?

I was able to reproduce something pretty similar to my setup in a VM, so I can easily attach a serial console or reboot it without bothering users of my actual hardware (or perhaps you can reproduce at your end). See attachment 1801000 [details]. This is a kickstart for Fedora 34 Server, which I installed via VirtualBox. No RAID in this setup, but I added an encrypted volume group via the installer's custom partitioning. With updates applied this boots with the 5.12.14 kernel, dracut 055-3, clevis-18, and prompts for the encryption password as you would expect.

Next I set up clevis/tang to talk to my two tang servers (at MYIP1 and MYIP2) with

dnf install clevis clevis-luks clevis-dracut
cfg='{"t":1,"pins":{"tang":[{"url":"http://MYIP1"},{"url":"http://MYIP2"}]}}'
clevis luks bind -d /dev/sda2 sss $cfg
dracut -f

Then added `ip=dhcp rd.neednet=1` to the kernel command line and rebooted.

This does not work, same as on my physical F34 desktops. Looks like the network doesn't come up, and I see no output from clevis. (By "does not work", I mean I can still type in the encryption password and boot normally, but clevis doesn't unlock the device automatically.)

It *does* work if I use the clevis scratch build provided in comment #19 (clevis bz1878892.01) and downgrade dracut to the original F34 version (`dnf distro-sync --disablerepo=updates dracut\*` yielding dracut 053-4).

I also tried previous F34 updates dracut versions (054-6, 054-12, and 055-2) and these all fall into the "does not work" category.


Note You need to log in before you can comment on or make changes to this bug.