Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1146967
Summary: | Latest firmware update causes SIGILL on xbeginq instruction on Haswell processors | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Amit Shah <amit.shah> | ||||||
Component: | glibc | Assignee: | Carlos O'Donell <codonell> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 21 | CC: | amit.shah, awilliam, bojan, branto, codonell, fweimer, hancockrwd, hdegoede, jakub, kparal, law, luto, nonamedotc, pfrankli, spoyarek, thetaeridanus | ||||||
Target Milestone: | --- | Keywords: | CommonBugs, Reopened | ||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | https://fedoraproject.org/wiki/Common_F21_bugs#haswell-microcode | ||||||||
Fixed In Version: | glibc-2.20-5.fc21 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2014-09-30 03:47:18 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Amit Shah
2014-09-26 13:02:45 UTC
Created attachment 941563 [details]
cpuinfo
We are spinning an F20, F20, and Rawhide glibc with lock elision disabled. This will hopefully mean that nobody ends up with a broken system if they update glibc *and* the microcode_ctl package at the same time. Scratch builds for F20, F21 and Rawhide in progress. Rawhide: http://koji.fedoraproject.org/koji/taskinfo?taskID=7702857 F21: http://koji.fedoraproject.org/koji/taskinfo?taskID=7702869 F20: http://koji.fedoraproject.org/koji/taskinfo?taskID=7702870 The rawhide build has somet testing issues I'm sorting out right now. Final build for f21: http://koji.fedoraproject.org/koji/taskinfo?taskID=7703592 Final build for f20: http://koji.fedoraproject.org/koji/taskinfo?taskID=7703676 Some related reading on this: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=762195 https://bugs.launchpad.net/intel/+bug/1370352 https://lkml.org/lkml/2014/9/18/218 glibc-2.18-16.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/glibc-2.18-16.fc20 glibc-2.20-4.fc21 has been submitted as an update for Fedora 21. https://admin.fedoraproject.org/updates/glibc-2.20-4.fc21 *** Bug 1147062 has been marked as a duplicate of this bug. *** I've just tested the F-21 update for this, and I'm afraid that the problem is still present there. I've also regenerated my initrd to make sure that that included the new glibc too, but that did not help. There's a long discussion about a real fix here: http://thread.gmane.org/gmane.linux.kernel/1790211 No great solution yet. (In reply to Hans de Goede from comment #10) > I've just tested the F-21 update for this, and I'm afraid that the problem > is still present there. I've also regenerated my initrd to make sure that > that included the new glibc too, but that did not help. Are you certain? This update absolutely removes elision, you shouldn't have any TSX usage going on after the update to glibc-2.20-4.fc21. Can you confirm the version of your installed glibc is correct? Can you track down if *all* your binaries fault or just some of them (statically compiled against libpthread)? (In reply to Carlos O'Donell from comment #12) > (In reply to Hans de Goede from comment #10) > > I've just tested the F-21 update for this, and I'm afraid that the problem > > is still present there. I've also regenerated my initrd to make sure that > > that included the new glibc too, but that did not help. > > Are you certain? This update absolutely removes elision, you shouldn't have > any TSX usage going on after the update to glibc-2.20-4.fc21. Can you > confirm the version of your installed glibc is correct? Can you track down > if *all* your binaries fault or just some of them (statically compiled > against libpthread)? OK, I think I found the problme. There is a code path in rwlock that is using TSX unconditionally. I'm going to fix that and push out another build. Hans, Would you mind testing this scratch build? http://koji.fedoraproject.org/koji/taskinfo?taskID=7707772 It should fully disable TSX usage in libpthread.so.0. I don't have easy access to a box that I can do this kind of testing on e.g. micrcode updates etc. (In reply to Carlos O'Donell from comment #12) > (In reply to Hans de Goede from comment #10) > > I've just tested the F-21 update for this, and I'm afraid that the problem > > is still present there. I've also regenerated my initrd to make sure that > > that included the new glibc too, but that did not help. > > Are you certain? This update absolutely removes elision, you shouldn't have > any TSX usage going on after the update to glibc-2.20-4.fc21. Can you > confirm the version of your installed glibc is correct? Yes I double checked I had the correct version (and regenerated my initrd and rebooted) before putting in the comment that the update does not fix things. (In reply to Carlos O'Donell from comment #14) > Hans, > > Would you mind testing this scratch build? > http://koji.fedoraproject.org/koji/taskinfo?taskID=7707772 > > It should fully disable TSX usage in libpthread.so.0. I don't have easy > access to a box that I can do this kind of testing on e.g. micrcode updates > etc. In the mean time I've installed this update: https://admin.fedoraproject.org/updates/dracut-038-29.git20140903.fc21,kernel-3.16.3-302.fc21 Which fixes things in a less of a big hammer approach, and that fixes things too. I can still reproduce the problem by booting an older kernel though. I've verified that booting an older kernel still exhibits the problem, then I've installed your glibc scratch build, and I can confirm that the problem is gone, even when using the older kernel, when using the glibc from the scratch build. Package glibc-2.18-16.fc20: * should fix your issue, * was pushed to the Fedora 20 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing glibc-2.18-16.fc20' as soon as you are able to, then reboot. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2014-11586/glibc-2.18-16.fc20 then log in and leave karma (feedback). (In reply to Hans de Goede from comment #15) > Which fixes things in a less of a big hammer approach, and that fixes things > too. > > I can still reproduce the problem by booting an older kernel though. I've > verified that booting an older kernel still exhibits the problem, then I've > installed your glibc scratch build, and I can confirm that the problem is > gone, even when using the older kernel, when using the glibc from the > scratch build. We are going to push the new glibc into F21 to fix this problem for anyone that doesn't want to upgrade their kernel. I think the conservative approach of a kernel fix, and runtime fix is the best here given that missing an update could cause your box to break if you install a new microcode_ctl. Final F21 build here: http://koji.fedoraproject.org/koji/taskinfo?taskID=7709522 *** Bug 1147118 has been marked as a duplicate of this bug. *** OK, final update for FC21 with a full fix was just pushed into Bodhi. https://admin.fedoraproject.org/updates/FEDORA-2014-11673/glibc-2.20-5.fc21 glibc-2.18-16.fc20 has been pushed to the Fedora 20 stable repository. If problems still persist, please make note of it in this bug report. Given the kernel is going to inject the microcode on start, and no execution of 'cpuid' is going to see those tmem instructions being exposed by the cpu, the glibc fix may not be needed at all? I think we can revert the glibc fix given the kernel has been fixed appropriately already. Relevant kernel bug is bug 1083716. (In reply to Amit Shah from comment #21) > Given the kernel is going to inject the microcode on start, and no execution > of 'cpuid' is going to see those tmem instructions being exposed by the cpu, > the glibc fix may not be needed at all? I think we can revert the glibc fix > given the kernel has been fixed appropriately already. Relevant kernel bug > is bug 1083716. The glibc fix is still useful because it fixes TSX code that sneaked out of the --enable-lock-elision configuration, which could potentially cause problems later. Also, there is no point keeping the bits enabled because if TSX does get enabled in future microcode and it behaves differently (breaking glibc expectations), we'll have to rush to patch older systems. We do need to re-enable elision for s390* in f21 and rawhide: https://lists.fedoraproject.org/pipermail/glibc/2014-September/000062.html That update to F-20, -16, appears broken. See my comments in the update. I have no idea why xrdp would just hang like that, but going back to either -11 or -14 immediately fixes the problem. I'm running this on an i686 VM, which is most likely running on VMWare ESX or something like that (I don't control this bit). (In reply to Bojan Smojver from comment #23) > That update to F-20, -16, appears broken. See my comments in the update. > > I have no idea why xrdp would just hang like that, but going back to either > -11 or -14 immediately fixes the problem. > > I'm running this on an i686 VM, which is most likely running on VMWare ESX > or something like that (I don't control this bit). Are you able to remote ssh into the box, attache a debugger, and do a backtrace to see where it's hung? (In reply to Carlos O'Donell from comment #24) > Are you able to remote ssh into the box, attache a debugger, and do a > backtrace to see where it's hung? Yeah, ssh works. Xrdp runs a couple of precesses, I can attach to both before I login and see. Maybe something that is forked off crashes or something like that. No idea at this point - the logs have nothing useful. A quick strace of the remaining xrdp processes just has one of them sitting in select. (In reply to Bojan Smojver from comment #25) > Yeah, ssh works. Xrdp runs a couple of precesses, I can attach to both > before I login and see. Maybe something that is forked off crashes or > something like that. No idea at this point - the logs have nothing useful. > > A quick strace of the remaining xrdp processes just has one of them sitting > in select. Wow - did I make a fool of myself or what? Xrdp works now with -16. Yesterday when I upgraded, it hung on login. I rebooted the VM. Hung again. I reverted to -11, rebooted. Worked. Installed -14, rebooted, worked. No idea... Anyhow, must have been some other condition somewhere that conicided with glibc upgrade or something. (In reply to Bojan Smojver from comment #26) > Wow - did I make a fool of myself or what? Xrdp works now with -16. Bojan, You are not a fool. I am incredibly appreciative of people like you who are willing to step up and say something is broken and help out. I have infinite patience for that kind of dedication. Thank you for raising the issue. I'm glad it turned out to be nothing, but it might not have been. glibc-2.20-5.fc21 has been pushed to the Fedora 21 stable repository. If problems still persist, please make note of it in this bug report. We are seeing something similar to this in rados binary [1]. On shutdown, the program calls rados_shutdown() which calls the appropriate destructors. In particular, it calls ~RWLock() which issues pthread_rwlock_unlock(). This causes the program to receive SIGILL signal. Debugging with gdb, it seems that the instruction that causes this is xend [2] which is an Intel TSX instruction. I am no expert in this matter but it seems that this issue is not fully resolved, yet. I've looked at the patches and xbegin seems to be explicitly disabled there. Maybe, we need to explicitly disable xend in the code as well? [1] https://bugzilla.redhat.com/show_bug.cgi?id=1144794 ; in short just run 'rados df' to reproduce (you need to have ceph-common installed) [2] layout asm in gdb shows this as the crashing line: >│0x7ffff6c75153 <__GI___pthread_rwlock_unlock+19> xend | (In reply to Boris Ranto from comment #29) > We are seeing something similar to this in rados binary [1]. On shutdown, > the program calls rados_shutdown() which calls the appropriate destructors. > In particular, it calls ~RWLock() which issues pthread_rwlock_unlock(). This > causes the program to receive SIGILL signal. Debugging with gdb, it seems > that the instruction that causes this is xend [2] which is an Intel TSX > instruction. > > I am no expert in this matter but it seems that this issue is not fully > resolved, yet. I've looked at the patches and xbegin seems to be explicitly > disabled there. Maybe, we need to explicitly disable xend in the code as > well? > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1144794 ; in short just run > 'rados df' to reproduce (you need to have ceph-common installed) > > [2] layout asm in gdb shows this as the crashing line: > > >│0x7ffff6c75153 <__GI___pthread_rwlock_unlock+19> xend | Please open a new bug for this and we can triage there. The 2.20-5 version should have fixed this. *** Bug 1146749 has been marked as a duplicate of this bug. *** |