Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1470352
Summary: | dnf --refresh update broken since 26 update | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Kostya Vasilyev <kmansoft> | ||||||||
Component: | nss | Assignee: | Daiki Ueno <dueno> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||
Severity: | unspecified | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 26 | CC: | alex.ploumistos, alexus_m, dhgutteridge, dmach, dueno, elio.maldonado.batiz, fabrice, freddy, georgmueller, jlebon, kartochka378, kdudka, kengert, kmansoft, lars.s.jensen, lupinix.fedora, mhatina, mrunge, netwiz, packaging-team-maint, rpm-software-management, vmukhame, yann | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2017-08-11 09:53:48 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 1460051 | ||||||||||
Bug Blocks: | |||||||||||
Attachments: |
|
Description
Kostya Vasilyev
2017-07-12 19:32:34 UTC
Not sure if this may be the cause, but found this just now: $ sudo tail -f /var/log/dnf.librepo.log 22:24:48 lr_handle_perform: Downloading/Locating yum repo 22:24:48 lr_yum_download_remote: Downloading/Copying repo.. 22:24:48 lr_yum_download_repomd: Downloading repomd.xml via mirrorlist 22:24:48 lr_download: Target: repodata/repomd.xml (-) 22:24:48 lr_prepare_lrmirrors: Preparing internal mirror list for handle id: 0x56472177ccd0 22:24:48 lr_prepare_lrmirrors: Mirror: https://download.sublimetext.com/rpm/dev/x86_64 22:24:48 select_next_target: Selecting mirror for: repodata/repomd.xml 22:24:48 prepare_next_transfer: URL: https://download.sublimetext.com/rpm/dev/x86_64/repodata/repomd.xml 22:24:48 add_librepo_xattr: Cannot set xattr user.Librepo.DownloadInProgress (fd: 35): Operation not supported 22:24:48 lr_download: Downloading started My file systems are ext4, not sure about xattr support: - /etc/fstab doesn't have user_xattr for / - findmnt isn't showing xattr either So maybe this is harmless and unrelated, but... --- Oh and despite the "Downloading started" message, it didn't... DNF is locked up and can't even be aborted by Ctrl+C (as I already mentioned above). Created attachment 1297196 [details]
strace
Last few lines...
The last one was the same on the several times it locked up -- futex(... WAIT ...)
stat("/etc/pki/nssdb/cert9.db-journal", 0x7ffc19e49ff0) = -1 ENOENT (No such file or directory)
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
pread64(11, "\0\0\0\2\0\0\0\t\0\0\0\0\0\0\0\0", 16, 24) = 16
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
stat("/etc/pki/nssdb/cert9.db-wal", 0x7ffc19e49ff0) = -1 ENOENT (No such file or directory)
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
fcntl(11, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
fcntl(11, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0
fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
stat("/etc/pki/nssdb/cert9.db-journal", 0x7ffc19e49ff0) = -1 ENOENT (No such file or directory)
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
pread64(11, "\0\0\0\2\0\0\0\t\0\0\0\0\0\0\0\0", 16, 24) = 16
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
stat("/etc/pki/nssdb/cert9.db-wal", 0x7ffc19e49ff0) = -1 ENOENT (No such file or directory)
fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0
fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
futex(0x55c7543cd900, FUTEX_WAIT_PRIVATE, 2, NULL)
same here, happens 3-4 times, some time "dnf upgrade" succeed, some stuck no idea why (network is perfect), librepo log exact like posted above can you attach full dnf.librepo.log? Created attachment 1298293 [details]
dnf.librepo.log
dnf.librepo.log as requested
it seems like it is timeouting on mirror.yandex.ru, could it be the case? Um, I've been using mirror.yandex.ru for a couple of years now, and not seen anything like this... Anything's possible I suppose (higher load due to the F26 release), however: - There is are no progress messages at the point where it's stuck - There are no error messages - Can't be aborted with Ctrl+C This kind of makes it look like it's not a network issue, and more like a concurrency bug (?). I just left it stuck for several minutes (again) just to be sure: 1 - It was once again stuck at: futex(0x562bfe5e7510, FUTEX_WAIT_PRIVATE, 2, NULL 2 - The list of network connections was: $ sudo ss -nap | egrep "dnf" tcp CLOSE-WAIT 348 0 192.168.88.11:50818 158.69.195.211:443 users:(("dnf",pid=15251,fd=30)) which is: 211.195.69.158.in-addr.arpa. 41671 IN PTR mirrorlist05.ovh.rpmfusion.net. I'm not super familiar with what CLOSE-WAIT means. 3 - After about 5-7 minutes, it was still stuck, nothing new in strace 4 - Still could not be aborted with Ctrl+C --- console output was: $ sudo strace dnf -v --refresh update 2> dnf.strace.log Loaded plugins: builddep, config-manager, copr, debug, debuginfo-install, download, generate_completion_cache, needs-restarting, playground, repoclosure, repograph, repomanage, reposync, system-upgrade DNF version: 2.5.1 cachedir: /var/cache/dnf reviving: 'fedora-cisco-openh264' can be revived - repomd matches. not found deltainfo for: Fedora 26 openh264 (From Cisco) - x86_64 not found updateinfo for: Fedora 26 openh264 (From Cisco) - x86_64 fedora-cisco-openh264: using metadata from Fri 24 Mar 2017 08:58:22 PM MSK. reviving: 'updates-testing' can be revived - repomd matches. updates-testing: using metadata from Fri 14 Jul 2017 02:23:44 AM MSK. reviving: failed for 'updates', mismatched repomd. Fedora 26 - x86_64 - Updates 9.3 MB/s | 4.3 MB 00:00 updates: using metadata from Fri 14 Jul 2017 03:48:03 PM MSK. reviving: 'fedora' can be revived - repomd matches. not found deltainfo for: Fedora 26 - x86_64 not found updateinfo for: Fedora 26 - x86_64 fedora: using metadata from Wed 05 Jul 2017 11:31:38 PM MSK. reviving: 'google-chrome' can be revived - repomd matches. not found deltainfo for: google-chrome not found updateinfo for: google-chrome google-chrome: using metadata from Wed 12 Jul 2017 09:05:54 PM MSK. reviving: 'home_snwh_paper' can be revived - repomd matches. not found deltainfo for: Paper Project (Fedora_25) not found updateinfo for: Paper Project (Fedora_25) home_snwh_paper: using metadata from Fri 10 Mar 2017 06:11:23 AM MSK. reviving: 'rpmfusion-free-updates' can be revived - metalink checksums match. not found deltainfo for: RPM Fusion for Fedora 26 - Free - Updates not found updateinfo for: RPM Fusion for Fedora 26 - Free - Updates rpmfusion-free-updates: using metadata from Thu 13 Jul 2017 04:16:07 PM MSK. reviving: 'rpmfusion-free' can be revived - metalink checksums match. not found deltainfo for: RPM Fusion for Fedora 26 - Free not found updateinfo for: RPM Fusion for Fedora 26 - Free rpmfusion-free: using metadata from Mon 10 Jul 2017 03:06:27 PM MSK. reviving: failed for 'rpmfusion-nonfree-updates', mismatched sha256 sum. --- But let's say I'm wrong and it is a network issue. - Shouldn't it time out after a reasonable time limit (15 - 30 seconds - a minute?) with an error message? - Shouldn't it be possible to kill it with Ctrl+C? Tried disabling rpmfusion repos, like this: dnf -v --refresh --disablerepo rpmfusion-free --disablerepo rpmfusion-free-updates --disablerepo rpmfusion-nonfree --disablerepo rpmfusion-nonfree-updates update Makes it somewhat harder to reproduce, but still happens say 1 run out of 3-5. Stacktrace: #0 0x00007fa005874fad in __lll_lock_wait () at /lib64/libpthread.so.0 #1 0x00007fa00586df44 in pthread_mutex_lock () at /lib64/libpthread.so.0 #2 0x00007f9fff96a1b9 in PR_Lock () at /lib64/libnspr4.so #3 0x00007fa000231cb5 in nssSlot_IsTokenPresent () at /lib64/libnss3.so #4 0x00007fa000231ee6 in nssSlot_GetToken () at /lib64/libnss3.so #5 0x00007fa00022c011 in nssTrustDomain_FindCertificateByIssuerAndSerialNumber () at /lib64/libnss3.so #6 0x00007fa00022c150 in nssTrustDomain_FindCertificateByEncodedCertificate () at /lib64/libnss3.so #7 0x00007fa00022737b in CERT_NewTempCertificate () at /lib64/libnss3.so #8 0x00007fa000724ef3 in ssl3_CompleteHandleCertificate () at /lib64/libssl3.so #9 0x00007fa0007275b6 in ssl3_HandleHandshakeMessage () at /lib64/libssl3.so #10 0x00007fa00072ad0a in ssl3_HandleRecord () at /lib64/libssl3.so #11 0x00007fa00072ca00 in ssl3_GatherCompleteHandshake () at /lib64/libssl3.so #12 0x00007fa000732e69 in SSL_ForceHandshake () at /lib64/libssl3.so #13 0x00007fa001bbdef5 in nss_connect_common () at /lib64/libcurl.so.4 #14 0x00007fa001bba550 in Curl_ssl_connect_nonblocking () at /lib64/libcurl.so.4 #15 0x00007fa001b6efd2 in https_connecting () at /lib64/libcurl.so.4 #16 0x00007fa001b95f36 in multi_runsingle () at /lib64/libcurl.so.4 #17 0x00007fa001b96fb3 in curl_multi_perform () at /lib64/libcurl.so.4 #18 0x00007fa0020196fe in lr_download () at /lib64/librepo.so.0 #19 0x00007fa002019a69 in lr_download_target () at /lib64/librepo.so.0 #20 0x00007fa002019b16 in lr_download_url () at /lib64/librepo.so.0 #21 0x00007fa002028117 in lr_yum_perform () at /lib64/librepo.so.0 #22 0x00007fa00201ebb9 in lr_handle_perform () at /lib64/librepo.so.0 #23 0x00007fa002238c92 in py_perform () at /usr/lib64/python3.6/site-packages/librepo/_librepo.so ---- Looks like it's trying to complete an SSL handshake - and is stuck waiting for "something", infinitely, trying to read the system "CA database". This matches what I'm seeing in strace output, just a bit above the futex, which is accessing a SQLite database presumably containing the certificates: stat("/etc/pki/nssdb/cert9.db-wal", 0x7ffe2a4be660) = -1 ENOENT (No such file or directory) fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0 fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0 fcntl(11, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0 fcntl(11, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0 fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0 stat("/etc/pki/nssdb/cert9.db-journal", 0x7ffe2a4be670) = -1 ENOENT (No such file or directory) fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0 pread64(11, "\0\0\0\2\0\0\0\t\0\0\0\0\0\0\0\0", 16, 24) = 16 fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0 stat("/etc/pki/nssdb/cert9.db-wal", 0x7ffe2a4be670) = -1 ENOENT (No such file or directory) fstat(11, {st_mode=S_IFREG|0644, st_size=9216, ...}) = 0 fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0 fcntl(11, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0 fcntl(11, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0 fcntl(11, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0 stat("/etc/pki/nssdb/cert9.db-journal", 0x7ffe2a4be3c0) = -1 ENOENT (No such file or directory) --- /etc/pki/nssdb/cert9.db does exist, it's non-empty, has a schema, a single table called nssPublic, and no data in it. There also is a zero-sized /etc/pki/cert9.db (???) I've been seeing this as well. It started sometime yesterday (14/07) when I noticed that the make cache operation that runs on a timer was stuck. I am still on F25. So far, running a "dnf clean all" before each transaction seems to do the trick. (In reply to Alexander Ploumistos from comment #9) > I've been seeing this as well. It started sometime yesterday (14/07) when I > noticed that the make cache operation that runs on a timer was stuck. I am > still on F25. So far, running a "dnf clean all" before each transaction > seems to do the trick. Wonder what updates you got recently? In particular, any of the libraries from my stack trace above: nss librepo libcurl ? For me on Fedora 26: nss-3.31.0-1.0.fc26.x86_64 librepo-1.7.20-3.fc26.x86_64 libcurl-7.53.1-7.fc26.x86_64 Created attachment 1298944 [details] last successful update transaction (In reply to Kostya Vasilyev from comment #10) > Wonder what updates you got recently? I checked my dnf history and I am attaching the last transaction before the problem appeared. > In particular, any of the libraries from my stack trace above: > > nss > librepo > libcurl > > ? > > For me on Fedora 26: > > nss-3.31.0-1.0.fc26.x86_64 > librepo-1.7.20-3.fc26.x86_64 > libcurl-7.53.1-7.fc26.x86_64 It seems that we only have nss in common (nss-3.31.0-1.0.fc25.x86_64). The other two are: librepo-1.7.18-3.fc25.x86_64 libcurl-7.51.0-7.fc25.x86_64 I haven't had anything crashing yet, just dnf waiting indefinitely and it responds to SIGTERM immediately. Definitely "dnf clean all" works this for me. I'm having the same problems with: # rpm -qa | grep -E '^(librepo-|nss-|libcurl-)' | sort libcurl-7.53.1-7.fc26.x86_64 librepo-1.7.20-3.fc26.x86_64 nss-3.31.0-1.0.fc26.x86_64 Same here (also same versions Steven Haigh reported). Downgrading to the NSS version here fixed the problem in F26 for me: https://koji.fedoraproject.org/koji/buildinfo?buildID=888916 Version there is: nss-3.30.2-1.1.fc26 After downgrading to this version, I can confirm dnf works successfully again. Leave karma here for F26: https://bodhi.fedoraproject.org/updates/FEDORA-2017-244f799ac9 I'm seeing this on F25 as well. I was running nss-3.31.0-1.0.fc25, like Alexander Ploumistos. Downgrading to nss-3.30.2-1.1.fc25.x86_64 resolved the problem for me. Added Daiki Ueno to this report as its his build (and this is filed under dnf, not nss so may be missed) Same here, running fc25, the only viable workaround for me is to downgrade to nss-3.30. nspr-4.15.0-1.fc26 nss-3.31.0-1.1.fc26 nss-softokn-3.31.0-1.0.fc26 nss-util-3.31.0-1.0.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-244f799ac9 Can confirm that updating to nss-3.31.0-1.1.fc26 fixes the issue. Have left +1 karma for the bodhi build. *** Bug 1471466 has been marked as a duplicate of this bug. *** *** Bug 1472954 has been marked as a duplicate of this bug. *** nspr-4.15.0-1.fc26, nss-3.31.0-1.1.fc26, nss-softokn-3.31.0-1.0.fc26, nss-util-3.31.0-1.0.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-244f799ac9 nspr-4.15.0-1.fc26, nss-3.31.0-1.1.fc26, nss-softokn-3.31.0-1.0.fc26, nss-util-3.31.0-1.0.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report. |