Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 2117859 - Resolving dnssec-enabled domains fails with openssl-pkcs11-0.4.12-2.fc37 on server and dnssec enabled (breaks FreeIPA openQA tests)
Summary: Resolving dnssec-enabled domains fails with openssl-pkcs11-0.4.12-2.fc37 on s...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: openssl-pkcs11
Version: 37
Hardware: All
OS: Linux
high
high
Target Milestone: ---
Assignee: Jakub Jelen
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: openqa
Depends On:
Blocks: 2120605 2122841
TreeView+ depends on / blocked
 
Reported: 2022-08-12 08:10 UTC by Adam Williamson
Modified: 2023-01-04 07:22 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-04 07:22:24 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FC-568 0 None None None 2022-08-12 08:17:59 UTC

Description Adam Williamson 2022-08-12 08:10:28 UTC
In openQA testing on Fedora 37/Rawhide, when openssl-pkcs11-0.4.12-2.fc37 is installed on the server, FreeIPA client enrolment via realmd or cockpit fails because the client cannot resolve the hostname kojipkgs.fedoraproject.org:

https://openqa.fedoraproject.org/tests/1358301#step/realmd_join_sssd/19

this seems to be dnssec-related (fedoraproject.org is dnssec-enabled at least as seen from within Fedora infra).

If the server has openssl-pkcs11-0.4.12-1.fc37, the test passes.

The same is not true on Fedora 36. On Fedora 36, things are fine with openssl-pkcs11-0.4.12-2.fc36.

I'm not sure why this is, but it's fully reproducible. We first saw the failure in update testing of the openssl-pkcs11-0.4.12-2.fc37 update; the tests did not fail in tests of other F37 updates around the same time. The first F37 Branched compose - Fedora-37-20220811.n.0 - included openssl-pkcs11-0.4.12-2.fc37 , and the tests failed there. For the second compose, we untagged that build, so it includes openssl-pkcs11-0.4.12-1.fc37 , and the tests passed on that compose.

Proposing as a Beta blocker, though this problem may only affect dnssec-enabled domains, it's at least a big problem for our automated tests, which can make it a blocker under the "Bug hinders execution of required Beta test plans or dramatically reduces test coverage" provision - https://fedoraproject.org/wiki/Fedora_37_Beta_Release_Criteria#Beta_Blocker_Bugs .

Comment 1 Adam Williamson 2022-08-16 12:43:42 UTC
Jakub posted on the other bug:

"Thank you for digging further. I actually see some weird issues also in the Fedora 36 container images with the latest two builds after rebase [1], but I was not able to reproduce them locally. Similarly all the low-level tests with openssl-pkcs11 (both upstream and dowsntream) worked just ok so I would need to have a bit more information on what is IPA doing at the time of failure and how the softhsm? is set up.

https://gitlab.com/jjelen/build-images/-/jobs/2883523060"

What IPA is doing at the time of failure: not a lot, really. This is how the affected tests work. We start from a clean, fresh install of Server. The FreeIPA server and client tests start simultaneously, and configure static networking and modify the DNF repository config a bit (see below). The client tests configure themselves to use the server test's IP address as their DNS server. The client tests then wait for the server test to complete FreeIPA deployment. That test sets some debug options for FreeIPA and bind, installs the freeipa-server package group, opens some firewall ports, and runs:

ipa-server-install -U --auto-forwarders --realm=TEST.OPENQA.FEDORAPROJECT.ORG --domain=test.openqa.fedoraproject.org --ds-password=monkeys123 --admin-password=monkeys123 --setup-dns --reverse-zone=2.16.172.in-addr.arpa --allow-zone-overlap

it then kinits as admin, sets an OTP for one of the client enrolment tests, creates a couple of user accounts, adds some HBAC rules, and does this:

ipa dnszone-mod test.openqa.fedoraproject.org. --allow-sync-ptr=TRUE

because of https://docs.pagure.org/bind-dyndb-ldap/BIND9/SyncPTR.html . Then it just sends a signal to the client tests telling them to enrol now, and waits. The client tests then attempt to enrol. The error happens immediately when they try this, when the enrolment process tries to download the required packages. We configure the tests to use the repo from the compose under test - e.g. https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20220815.n.0/compose/Everything/x86_64/os/ for yesterday's Rawhide; this is to make sure tests use the exact packages from the compose being tested, not earlier ones (if they run before the compose is synced to mirrors) or later ones (if we re-run them after another compose has run and been synced). It's resolving the host 'kojipkgs.fedoraproject.org' that fails.

As for "how the softhsm is set up", what info do you need exactly, and how would I get it? Thanks!

Comment 2 Adam Williamson 2022-08-16 12:46:20 UTC
Looking at the latest Rawhide failure, I do actually have logs from the server in that case, including some messages from bind that look interesting:

Aug 15 09:05:59 ipa001.test.openqa.fedoraproject.org named[7581]: EVP_PKEY_fromdata_init failed (crypto failure)
Aug 15 09:05:59 ipa001.test.openqa.fedoraproject.org named[7581]: error:03000096:digital envelope routines::operation not supported for this keytype:crypto/evp/pmeth_gn.c:354:
Aug 15 09:05:59 ipa001.test.openqa.fedoraproject.org named[7581]: EVP_PKEY_fromdata_init failed (crypto failure)
Aug 15 09:05:59 ipa001.test.openqa.fedoraproject.org named[7581]: error:03000096:digital envelope routines::operation not supported for this keytype:crypto/evp/pmeth_gn.c:354:
Aug 15 09:05:59 ipa001.test.openqa.fedoraproject.org named[7581]: EVP_PKEY_fromdata_init failed (crypto failure)
Aug 15 09:05:59 ipa001.test.openqa.fedoraproject.org named[7581]: error:03000096:digital envelope routines::operation not supported for this keytype:crypto/evp/pmeth_gn.c:354:
Aug 15 09:05:59 ipa001.test.openqa.fedoraproject.org named[7581]: EVP_PKEY_fromdata_init failed (crypto failure)
Aug 15 09:05:59 ipa001.test.openqa.fedoraproject.org named[7581]: error:03000096:digital envelope routines::operation not supported for this keytype:crypto/evp/pmeth_gn.c:354:
Aug 15 09:08:52 ipa001.test.openqa.fedoraproject.org named[7581]: EVP_PKEY_fromdata_init failed (crypto failure)

there are several more repetitions of that error, and they are around the time the client tests would be doing name resolution, I believe.

Comment 3 Jakub Jelen 2022-08-17 16:27:33 UTC
Ok, on another try (with a bit more fresh head) I can see that I can reproduce the libssh failure also with the fedora 36 and with https://bodhi.fedoraproject.org/updates/FEDORA-2022-2f6e9a0b6c

From the libssh I was able to debug this to the following patch, which I would like to give some more testing before I will update the package in Fedora:

https://github.com/OpenSC/libp11/pull/470

Comment 4 Adam Williamson 2022-08-17 17:06:27 UTC
Thanks a lot for that!

I'd like to have it in Rawhide at least as soon as possible. We blocked the broken build from going into F37, but it's in Rawhide already, and it makes the tests fail on every Rawhide compose and every Rawhide update, which really messes up my openQA dashboard overview :P

Thanks again!

Comment 5 Jakub Jelen 2022-08-17 17:34:36 UTC
The upstream issue is merged now, but the libssh now fails in another test -- I am going to check if there is something we can do about that and update rawhide depending on the results.

Comment 6 Adam Williamson 2022-08-18 02:51:30 UTC
So I did a scratch build downstream with the PR applied as a patch and ran the openQA tests on it, but unfortunately they still fail :| One of the tests got an interesting error on the server end that I never saw before. I'm re-running them now to see if the failures are consistent.

Tests are still passing on F37 with openssl-pkcs11-0.4.12-1.fc37.

Comment 7 Adam Williamson 2022-08-18 17:55:37 UTC
The failures with the scratch build seem to be consistent. I also confirmed that the server is still showing the same error messages from comment 2. I also did a scratch build reverting to the -1 state, and as expected, the tests passed with that build.

For openQA purposes I'm gonna test disabling dnssec in the test entirely. It's a big problem for these tests to fail on every Rawhide compose and update; it makes it much harder to keep on top of other problems when a large chunk of updates always have a pile of failures.

Comment 8 Adam Williamson 2022-08-18 18:43:51 UTC
It does look like disabling dnssec on the server end avoids the problem for openQA's tests, so we can live with that for now. Obviously it'd be better if openQA was testing dnssec, though.

Comment 9 Luna Jernberg 2022-08-22 16:36:14 UTC
BetaBlocker +1

Comment 10 Adam Williamson 2022-08-22 16:46:56 UTC
Luna: you have to vote in the Pagure ticket - https://pagure.io/fedora-qa/blocker-review/issue/852 - not the bug report. The pagure tickets exist so blocker discussion doesn't clutter up the bug report.

Comment 11 Geoffrey Marr 2022-08-22 21:22:04 UTC
Discussed during the 2022-08-22 blocker review meeting: [0]

The decision to delay the classification of this as a blocker bug was made so adamw can do some more research and get the story of what exactly is affected and what needs doing sorted out.

[0] https://meetbot.fedoraproject.org/fedora-blocker-review/2022-08-22/f37-blocker-review.2022-08-22-16.01.txt

Comment 12 Adam Williamson 2022-08-22 21:56:24 UTC
I tested and confirmed this affects upgrades too - I ran an upgrade test where the server and client were deployed as F37, then the server upgraded to Rawhide; the client should then have upgraded to Rawhide too, but the upgrade process failed on this bug.

I guess technically as long as we hold the -2 update out of F37 *this* bug is not an F37 Beta blocker, but it's a problem that the underlying https://bugzilla.redhat.com/show_bug.cgi?id=2115865 is still present. Not sure if it's a release-blocking problem. I guess I can move the nomination over there for now.

Comment 13 Jakub Jelen 2022-08-24 14:22:16 UTC
Adam, I have another version of the patch, that reworked a lot of referencing and dereferencing. Can you run some of your tests to check if it works for you?

https://github.com/OpenSC/libp11/pull/471

I have a scratch build here:

https://koji.fedoraproject.org/koji/taskinfo?taskID=91211042

Comment 14 Adam Williamson 2022-08-25 06:53:39 UTC
Nope, sorry, still looks bad. Same error messages.

Comment 15 Jakub Jelen 2022-08-25 12:04:15 UTC
In libssh, I managed to solve the issue with changes to libssh only as it was doing weird things with the ENGINEs initialization and deinitialization, which brought the whole engine and openssl into weird state when the engine worked, but no deinitialization was invoked:

https://gitlab.com/libssh/libssh-mirror/-/merge_requests/278

This was problem already on Fedora 36 with the new openssl-pkcs11 update.

The problem you describe is happening only on Fedora 37 so let me try to start a vm or container with Fedora 37 to give it some try.

Comment 16 Jakub Jelen 2023-01-03 14:23:08 UTC
Adam, what is the status of this bug?

Comment 17 Adam Williamson 2023-01-04 07:22:24 UTC
I'm pretty sure it's fine now. We still have the test do `--no-dnssec-validation` on upgrade tests, but that's for #1999321 . non-upgrade tests of 36, 37 and Rawhide are all working fine with dnssec enabled.


Note You need to log in before you can comment on or make changes to this bug.