Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1945901
Description
Tim Rhodes
2021-04-02 15:35:30 UTC
crypto is just a library and is unrelated to luks. Tentatively reassigning to dracut based on the information provided. >System fails to recovery shell on boot
Can you attach the /run/initramfs/rdsosreport.txt that dracut produced when it ran into this problem? This might be a release blocking bug.
Created attachment 1768646 [details]
rdsosreport.txt
/run/initramfs/rdsosreport.txt
OK so the layout looks like it's LVM+LUKS, and it's getting stuck on the encrypted swap LV. But it's both systemd and dracut getting hung up on not finding it: [ 92.033709] fedora systemd[1]: dev-disk-by\x2duuid-3355902c\x2db6d5\x2d48e0\x2d96b8\x2d00f4f6df8930.device: Job dev-disk-by\x2duuid-3355902c\x2db6d5\x2d48e0\x2d96b8\x2d00f4f6df8930.device/start timed out. [ 92.034574] fedora systemd[1]: Timed out waiting for device /dev/disk/by-uuid/3355902c-b6d5-48e0-96b8-00f4f6df8930. [ 92.034707] fedora systemd[1]: Dependency failed for Cryptography Setup for luks-3355902c-b6d5-48e0-96b8-00f4f6df8930. and [ 222.757973] fedora dracut-initqueue[534]: [ -e /dev/disk/by-id/dm-uuid-CRYPT-LUKS?-*3355902cb6d548e096b800f4f6df8930*-* ] || exit 1" So what does this look like in the *working* case? Can you boot a working menu entry from GRUB, adding the rd.debug boot parameter, boot normally, and then attach a journal? Attach all of it, or snip off anything after Switching root, doesn't matter either way. journalctl -b -o short-monotonic --no-hostname > works-journal.log And then attach the log to this bug. Thanks! Created attachment 1768673 [details]
journalctl -b -o short-monotonic --no-hostname > works-journal.log from good boot
Log from good boot
Created attachment 1768674 [details]
dmesg from good boot
dmesg fron good boot
Missed the rd.debug, collecting that now. Created attachment 1768675 [details]
journalctl -b -o short-monotonic --no-hostname > works-journal.log from good boot with rd.debug
journalctl -b -o short-monotonic --no-hostname > works-journal.log from good boot with rd.debug
Created attachment 1768676 [details]
dmesg from good boot with rd.debug
dmesg from good boot with rd.debug
From the failed boot rdsosreport: This huge time gap is suspicious. [ 4.554627] fedora kernel: sd 0:0:0:0: [sda] Attached SCSI removable disk [ 92.033709] fedora systemd[1]: dev-disk-by\x2duuid-3355902c\x2db6d5\x2d48e0\x2d96b8\x2d00f4f6df8930.device: Job dev-disk-by\x2duuid-3355902c\x2db6d5\x2d48e0\x2d96b8\x2d00f4f6df8930.device/start timed out. None of the LV's are active. Since the LV's are encrypted, cryptsetup fails to unlock because crypto devices aren't found. Somehow though, blkid sees the LV's (as active) and the fact they have LUKS. [ 92.191475] fedora dracut-initqueue[581]: Scanning devices nvme0n1p2 nvme0n1p3 for LVM logical volumes sysvg/root sysvg/swap [ 92.260545] fedora dracut-initqueue[606]: inactive '/dev/sysvg/root' [40.00 GiB] inherit [ 92.260881] fedora dracut-initqueue[606]: inactive '/dev/sysvg/swap' [<7.61 GiB] inherit [ 92.260881] fedora dracut-initqueue[606]: inactive '/dev/homevg/home' [400.00 GiB] inherit [ 222.754905] fedora dracut-initqueue[534]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks: [ 222.757973] fedora dracut-initqueue[534]: Warning: /lib/dracut/hooks/initqueue/finished/90-crypt.sh: "[ -e /dev/disk/by-id/dm-uuid-CRYPT-LUKS?-*589424836a714b8b933996c540823811*-* ] || exit 1 On the failed boot, there are stall in the console message output, there a 1m30s start job that times out, then it reaches a message Starting dracut initqueue hook..., another stall, then a recursion of messages (100's of messages) before it fails to the recovery shell. Ok in the working case we see it get activated: [ 5.242944] dracut-initqueue[592]: + lvm lvchange --yes -K -ay --ignorelockingfailure --ignoremonitoring --poll n sysvg/root I'd like to see the failing boot with rd.debug as well because the rdsosreport doesn't have this level of detail. That is, please do journalctl -b -o short-monotonic --no-hostname > fails-journal.log ##from failed boot, with rd.debug parameter. Created attachment 1768678 [details]
rdsosreport.txt from failed boot with rd.debug
rdsosreport.txt from failed boot with rd.debug
I did not note and attempt the journalctl in the recovery shell. I'll try that again. Created attachment 1768690 [details]
journalctl -b -o short-monotonic --no-hostname > fails-journal.log from failed boot in emergency mode shell
journalctl -b -o short-monotonic --no-hostname > fails-journal.log from failed boot in emergency mode shell
Yeah in the fails case it's super late: [ 92.334735] dracut-initqueue[584]: + lvm lvchange --yes -K -ay --ignorelockingfailure --ignoremonitoring --poll n sysvg/root And there is still a huge gap with nothing in the log: [ 12.768554] systemd-udevd[402]: Worker [425] exited [ 91.947152] systemd[1]: dev-disk-by\x2duuid-3355902c\x2db6d5\x2d48e0\x2d96b8\x2d00f4f6df8930.device: Job dev-disk-by\x2duuid-3355902c\x2db6d5\x2d48e0\x2d96b8\x2d00f4f6df8930.device/start timed out. Another possible factor is this is systemd or a negative interaction between systemd(-udev) and dracut because dracut isn't the only thing that changes in the failing case: The working case has an older version of systemd in the initramfs. [ 1.864220] systemd[1]: systemd v248~rc2-6.fc34 running in system mode. The failing case has the release version of systemd in the initramfs. [ 1.882006] systemd[1]: systemd v248-1.fc34 running in system mode. Can you retry the failing case, instead of rd.debug, use both systemd.log_level=debug udev.log-priority=debug And name it something like journalctl -b -o short-monotonic --no-hostname > fails-moredebug-journal.log Created attachment 1768694 [details]
journalctl -b -o short-monotnic --no-hostname with systemd.log_level=debug udev.log-priority=debug
journalctl -b -o short-monotnic --no-hostname with systemd.log_level=debug udev.log-priority=debug
This is a smaller file from the prev data collection. Here's what /proc/cmdline looked like in the emregency shell...
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.11.11-300.fc34.x86_64 root=UUID=5ae50f89-3dda-4dee-a085-b7307a61e67f ro resume=/dev/mapper/luks-3355902c-b6d5-48e0-96b8-00f4f6df8930 rd.luks.uuid=luks-58942483-6a71-4b8b-9339-96c540823811 rd.lvm.lv=sysvg/root rd.luks.uuid=luks-3355902c-b6d5-48e0-96b8-00f4f6df8930 rd.lvm.lv=sysvg/swap rhgb quiet systemd.log_level=debug udev.log-priority=debug
Need to break for the night, I'll check back tomorrow AM. [ 9.091335] systemd-udevd[406]: Worker [445] exited [ 92.118585] systemd[1]: dev-disk-by\x2duuid-3355902c\x2db6d5\x2d48e0\x2d96b8\x2d00f4f6df8930.device: Job dev-disk-by\x2duuid-3355902c\x2db6d5\x2d48e0\x2d96b8\x2d00f4f6df8930.device/start timed out. I'm beginning to think this may be a systemd bug. What happens if you pick the failing boot entry but modify to remove both 'rhgb quiet' parameters? Do you get a 'please enter passphrase for disk luks...." line? And if you enter the luks passphrase here, does boot proceed? Removed rhgb and quiet, but it does not prompt for the decryption key. I unpacked the working and failing initramfs and have been reviewing contents. It seems the password prompt is in ./usr/lib/dracut-crypt-lib.sh. It's not prompting for password, so not calling the function in the script or not reaching/running this script. I started wondering about blkid determining the devices. While in the energency shell I ran a blkid -k and it supports crypto_LUKS in the dracut version of this executable. While in the emergency shell I collected an 'ls -l /dev' and 'blkid -k' output. These files are uploaded to the case. I ran 'blkid' by itself and found that it reports both the device parition and the device mapper device. I tried mounting the /dev/mapper device and it fails. The output is also loaded to this case. I did not think at the time and do not have a 'ls -l /dev/mapper'. I'm tied up, away from the system until 10AM EDT or so, but will check back when I return. Created attachment 1768791 [details]
ls -l /dev in emergency shell
ls -l /dev in emergency shell
Created attachment 1768792 [details]
blkid -k in emergency shell
blkid -k in emergency shell
Created attachment 1768793 [details]
blkid in emergency shell
blkid in emergency shell
Created attachment 1768794 [details]
mount attempt in emergency shell
mount attempt in emergency shell
A few mote notes, I got back into the failed boot emergency shell. The ls -l /dev/mapper looks like the correct entries, like my system booted with the original kernel. lsmod shows that dm_crypt is not loaded. I modprobe'd dm_crypt and it loaded and attempted to mount /dev/mapper/sysvg-root and still got mount: /mnt: unknown filesystem type 'crypto_LUKS'. On my running system, lsmod shows dm_crypt used by count is 3, but in the emergency shell it's 0. Googling, showed various approaches of use cryptsetup to manually mount a LUKS encrypted volume, but cryptsetup is not in the emergency shell environment. OK I'm thinking this bug is the same as 1945596. The layout is LVM->LUKS in both cases, and it looks to me like none of the LVM LV's get activated, thus crypto device(s) never appear and then we end up with: [ 91.643732] systemd[1]: Dependency failed for Cryptography Setup for luks-f4a5c59c-b893-4b89-af80-20c2f4360103. Sorry, bad paste in c27 [ 92.120078] systemd[1]: Dependency failed for Cryptography Setup for luks-3355902c-b6d5-48e0-96b8-00f4f6df8930. If you get a chance and are willing to take the risk, can you try making the changes in https://github.com/dracutdevs/dracut/commit/7fcc4955884355c3943d6c41e459b4b983a818f4 And see if that fixes the problem? You would need to rebuild+replace the *bad* initramfs for this change to get incorporated. OK never mind, as figured out in bug 1945596, this is the fix. https://github.com/dracutdevs/dracut/commit/ba4bcf5f4f11ad624c647ddf4f566997186135e7 That is the fix. I modified /usr/lib/dracut/modules.d/35network-manager/nm-run.service with the change in the dracutdevs project and regnerated my initramfs using dracut --regenerate-all --force and the current/latest kernel (5.11.11-300.fc34.x86_64) works. Thank you for you effort and attention. Comment 30 fixed booting for me as well. Same bug. Fixed with comment 30. FEDORA-2021-50707f8501 has been submitted as an update to Fedora 34. https://bodhi.fedoraproject.org/updates/FEDORA-2021-50707f8501 FEDORA-2021-50707f8501 has been pushed to the Fedora 34 stable repository. If problem still persists, please make note of it in this bug report. |