Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 2117859
Summary: | Resolving dnssec-enabled domains fails with openssl-pkcs11-0.4.12-2.fc37 on server and dnssec enabled (breaks FreeIPA openQA tests) | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Adam Williamson <awilliam> |
Component: | openssl-pkcs11 | Assignee: | Jakub Jelen <jjelen> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 37 | CC: | ansasaki, crypto-team, droidbittin, gmarr, jjelen, pemensik, robatino |
Target Milestone: | --- | Keywords: | Triaged |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | openqa | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-01-04 07:22:24 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2120605, 2122841 |
Description
Adam Williamson
2022-08-12 08:10:28 UTC
Jakub posted on the other bug: "Thank you for digging further. I actually see some weird issues also in the Fedora 36 container images with the latest two builds after rebase [1], but I was not able to reproduce them locally. Similarly all the low-level tests with openssl-pkcs11 (both upstream and dowsntream) worked just ok so I would need to have a bit more information on what is IPA doing at the time of failure and how the softhsm? is set up. https://gitlab.com/jjelen/build-images/-/jobs/2883523060" What IPA is doing at the time of failure: not a lot, really. This is how the affected tests work. We start from a clean, fresh install of Server. The FreeIPA server and client tests start simultaneously, and configure static networking and modify the DNF repository config a bit (see below). The client tests configure themselves to use the server test's IP address as their DNS server. The client tests then wait for the server test to complete FreeIPA deployment. That test sets some debug options for FreeIPA and bind, installs the freeipa-server package group, opens some firewall ports, and runs: ipa-server-install -U --auto-forwarders --realm=TEST.OPENQA.FEDORAPROJECT.ORG --domain=test.openqa.fedoraproject.org --ds-password=monkeys123 --admin-password=monkeys123 --setup-dns --reverse-zone=2.16.172.in-addr.arpa --allow-zone-overlap it then kinits as admin, sets an OTP for one of the client enrolment tests, creates a couple of user accounts, adds some HBAC rules, and does this: ipa dnszone-mod test.openqa.fedoraproject.org. --allow-sync-ptr=TRUE because of https://docs.pagure.org/bind-dyndb-ldap/BIND9/SyncPTR.html . Then it just sends a signal to the client tests telling them to enrol now, and waits. The client tests then attempt to enrol. The error happens immediately when they try this, when the enrolment process tries to download the required packages. We configure the tests to use the repo from the compose under test - e.g. https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20220815.n.0/compose/Everything/x86_64/os/ for yesterday's Rawhide; this is to make sure tests use the exact packages from the compose being tested, not earlier ones (if they run before the compose is synced to mirrors) or later ones (if we re-run them after another compose has run and been synced). It's resolving the host 'kojipkgs.fedoraproject.org' that fails. As for "how the softhsm is set up", what info do you need exactly, and how would I get it? Thanks! Looking at the latest Rawhide failure, I do actually have logs from the server in that case, including some messages from bind that look interesting: Aug 15 09:05:59 ipa001.test.openqa.fedoraproject.org named[7581]: EVP_PKEY_fromdata_init failed (crypto failure) Aug 15 09:05:59 ipa001.test.openqa.fedoraproject.org named[7581]: error:03000096:digital envelope routines::operation not supported for this keytype:crypto/evp/pmeth_gn.c:354: Aug 15 09:05:59 ipa001.test.openqa.fedoraproject.org named[7581]: EVP_PKEY_fromdata_init failed (crypto failure) Aug 15 09:05:59 ipa001.test.openqa.fedoraproject.org named[7581]: error:03000096:digital envelope routines::operation not supported for this keytype:crypto/evp/pmeth_gn.c:354: Aug 15 09:05:59 ipa001.test.openqa.fedoraproject.org named[7581]: EVP_PKEY_fromdata_init failed (crypto failure) Aug 15 09:05:59 ipa001.test.openqa.fedoraproject.org named[7581]: error:03000096:digital envelope routines::operation not supported for this keytype:crypto/evp/pmeth_gn.c:354: Aug 15 09:05:59 ipa001.test.openqa.fedoraproject.org named[7581]: EVP_PKEY_fromdata_init failed (crypto failure) Aug 15 09:05:59 ipa001.test.openqa.fedoraproject.org named[7581]: error:03000096:digital envelope routines::operation not supported for this keytype:crypto/evp/pmeth_gn.c:354: Aug 15 09:08:52 ipa001.test.openqa.fedoraproject.org named[7581]: EVP_PKEY_fromdata_init failed (crypto failure) there are several more repetitions of that error, and they are around the time the client tests would be doing name resolution, I believe. Ok, on another try (with a bit more fresh head) I can see that I can reproduce the libssh failure also with the fedora 36 and with https://bodhi.fedoraproject.org/updates/FEDORA-2022-2f6e9a0b6c From the libssh I was able to debug this to the following patch, which I would like to give some more testing before I will update the package in Fedora: https://github.com/OpenSC/libp11/pull/470 Thanks a lot for that! I'd like to have it in Rawhide at least as soon as possible. We blocked the broken build from going into F37, but it's in Rawhide already, and it makes the tests fail on every Rawhide compose and every Rawhide update, which really messes up my openQA dashboard overview :P Thanks again! The upstream issue is merged now, but the libssh now fails in another test -- I am going to check if there is something we can do about that and update rawhide depending on the results. So I did a scratch build downstream with the PR applied as a patch and ran the openQA tests on it, but unfortunately they still fail :| One of the tests got an interesting error on the server end that I never saw before. I'm re-running them now to see if the failures are consistent. Tests are still passing on F37 with openssl-pkcs11-0.4.12-1.fc37. The failures with the scratch build seem to be consistent. I also confirmed that the server is still showing the same error messages from comment 2. I also did a scratch build reverting to the -1 state, and as expected, the tests passed with that build. For openQA purposes I'm gonna test disabling dnssec in the test entirely. It's a big problem for these tests to fail on every Rawhide compose and update; it makes it much harder to keep on top of other problems when a large chunk of updates always have a pile of failures. It does look like disabling dnssec on the server end avoids the problem for openQA's tests, so we can live with that for now. Obviously it'd be better if openQA was testing dnssec, though. BetaBlocker +1 Luna: you have to vote in the Pagure ticket - https://pagure.io/fedora-qa/blocker-review/issue/852 - not the bug report. The pagure tickets exist so blocker discussion doesn't clutter up the bug report. Discussed during the 2022-08-22 blocker review meeting: [0] The decision to delay the classification of this as a blocker bug was made so adamw can do some more research and get the story of what exactly is affected and what needs doing sorted out. [0] https://meetbot.fedoraproject.org/fedora-blocker-review/2022-08-22/f37-blocker-review.2022-08-22-16.01.txt I tested and confirmed this affects upgrades too - I ran an upgrade test where the server and client were deployed as F37, then the server upgraded to Rawhide; the client should then have upgraded to Rawhide too, but the upgrade process failed on this bug. I guess technically as long as we hold the -2 update out of F37 *this* bug is not an F37 Beta blocker, but it's a problem that the underlying https://bugzilla.redhat.com/show_bug.cgi?id=2115865 is still present. Not sure if it's a release-blocking problem. I guess I can move the nomination over there for now. Adam, I have another version of the patch, that reworked a lot of referencing and dereferencing. Can you run some of your tests to check if it works for you? https://github.com/OpenSC/libp11/pull/471 I have a scratch build here: https://koji.fedoraproject.org/koji/taskinfo?taskID=91211042 Nope, sorry, still looks bad. Same error messages. In libssh, I managed to solve the issue with changes to libssh only as it was doing weird things with the ENGINEs initialization and deinitialization, which brought the whole engine and openssl into weird state when the engine worked, but no deinitialization was invoked: https://gitlab.com/libssh/libssh-mirror/-/merge_requests/278 This was problem already on Fedora 36 with the new openssl-pkcs11 update. The problem you describe is happening only on Fedora 37 so let me try to start a vm or container with Fedora 37 to give it some try. Adam, what is the status of this bug? I'm pretty sure it's fine now. We still have the test do `--no-dnssec-validation` on upgrade tests, but that's for #1999321 . non-upgrade tests of 36, 37 and Rawhide are all working fine with dnssec enabled. |