Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.

Bug 1146967

Summary: Latest firmware update causes SIGILL on xbeginq instruction on Haswell processors
Product: [Fedora] Fedora Reporter: Amit Shah <amit.shah>
Component: glibcAssignee: Carlos O'Donell <codonell>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 21CC: amit.shah, awilliam, bojan, branto, codonell, fweimer, hancockrwd, hdegoede, jakub, kparal, law, luto, nonamedotc, pfrankli, spoyarek, thetaeridanus
Target Milestone: ---Keywords: CommonBugs, Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: https://fedoraproject.org/wiki/Common_F21_bugs#haswell-microcode
Fixed In Version: glibc-2.20-5.fc21 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-09-30 03:47:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
gdb analysis of core dump
none
cpuinfo none

Description Amit Shah 2014-09-26 13:02:45 UTC
Created attachment 941562 [details]
gdb analysis of core dump

Description of problem:

The 2.1-8 update to microcode_ctl incorporates the latest Intel firmware.  This update causes rtm and hle instructions in glibc lock elision code to SIGILL, causing all apps on my Haswell system to crash.  The first crash happens in systemd-udevd, which means I don't get along much further in the boot sequence.

Siddhesh and I found this while debugging systemd-udevd core dumps, and we found xbeginq was the instruction getting SIGILL.

Checking for Intel errata, Siddesh found a reference to https://lkml.org/lkml/2014/9/18/218 where they mention this behaviour in the firmware might be deliberate (to keep advertising hle/rtm instructions, but causing them to SIGILL), rather than a bug.

We might need fixes across the kernel and glibc for this, or quirks for this hardware and microcode update in glibc and the kernel.

Core dump info in attachments.

microcode_ctl-2.1-8 causes the badness; 2.1-7 is fine (which has older firmware).

Comment 1 Amit Shah 2014-09-26 13:03:21 UTC
Created attachment 941563 [details]
cpuinfo

Comment 2 Carlos O'Donell 2014-09-26 13:04:42 UTC
We are spinning an F20, F20, and Rawhide glibc with lock elision disabled. This will hopefully mean that nobody ends up with a broken system if they update glibc *and* the microcode_ctl package at the same time.

Comment 3 Carlos O'Donell 2014-09-26 13:16:55 UTC
Scratch builds for F20, F21 and Rawhide in progress.

Comment 5 Carlos O'Donell 2014-09-26 17:01:39 UTC
The rawhide build has somet testing issues I'm sorting out right now.

Final build for f21:
http://koji.fedoraproject.org/koji/taskinfo?taskID=7703592

Final build for f20:
http://koji.fedoraproject.org/koji/taskinfo?taskID=7703676

Comment 7 Fedora Update System 2014-09-26 18:19:12 UTC
glibc-2.18-16.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/glibc-2.18-16.fc20

Comment 8 Fedora Update System 2014-09-26 18:19:20 UTC
glibc-2.20-4.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/glibc-2.20-4.fc21

Comment 9 Carlos O'Donell 2014-09-26 18:35:18 UTC
*** Bug 1147062 has been marked as a duplicate of this bug. ***

Comment 10 Hans de Goede 2014-09-26 18:53:22 UTC
I've just tested the F-21 update for this, and I'm afraid that the problem is still present there. I've also regenerated my initrd to make sure that that included the new glibc too, but that did not help.

Comment 11 Andy Lutomirski 2014-09-26 18:54:44 UTC
There's a long discussion about a real fix here:

http://thread.gmane.org/gmane.linux.kernel/1790211

No great solution yet.

Comment 12 Carlos O'Donell 2014-09-27 03:59:32 UTC
(In reply to Hans de Goede from comment #10)
> I've just tested the F-21 update for this, and I'm afraid that the problem
> is still present there. I've also regenerated my initrd to make sure that
> that included the new glibc too, but that did not help.

Are you certain? This update absolutely removes elision, you shouldn't have any TSX usage going on after the update to glibc-2.20-4.fc21. Can you confirm the version of your installed glibc is correct? Can you track down if *all* your binaries fault or just some of them (statically compiled against libpthread)?

Comment 13 Carlos O'Donell 2014-09-27 04:44:53 UTC
(In reply to Carlos O'Donell from comment #12)
> (In reply to Hans de Goede from comment #10)
> > I've just tested the F-21 update for this, and I'm afraid that the problem
> > is still present there. I've also regenerated my initrd to make sure that
> > that included the new glibc too, but that did not help.
> 
> Are you certain? This update absolutely removes elision, you shouldn't have
> any TSX usage going on after the update to glibc-2.20-4.fc21. Can you
> confirm the version of your installed glibc is correct? Can you track down
> if *all* your binaries fault or just some of them (statically compiled
> against libpthread)?

OK, I think I found the problme. There is a code path in rwlock that is using TSX unconditionally. I'm going to fix that and push out another build.

Comment 14 Carlos O'Donell 2014-09-27 04:55:24 UTC
Hans,

Would you mind testing this scratch build?
http://koji.fedoraproject.org/koji/taskinfo?taskID=7707772

It should fully disable TSX usage in libpthread.so.0. I don't have easy access to a box that I can do this kind of testing on e.g. micrcode updates etc.

Comment 15 Hans de Goede 2014-09-27 09:08:46 UTC
(In reply to Carlos O'Donell from comment #12)
> (In reply to Hans de Goede from comment #10)
> > I've just tested the F-21 update for this, and I'm afraid that the problem
> > is still present there. I've also regenerated my initrd to make sure that
> > that included the new glibc too, but that did not help.
> 
> Are you certain? This update absolutely removes elision, you shouldn't have
> any TSX usage going on after the update to glibc-2.20-4.fc21. Can you
> confirm the version of your installed glibc is correct?

Yes I double checked I had the correct version (and regenerated my initrd and rebooted) before putting in the comment that the update does not fix things.

(In reply to Carlos O'Donell from comment #14)
> Hans,
> 
> Would you mind testing this scratch build?
> http://koji.fedoraproject.org/koji/taskinfo?taskID=7707772
> 
> It should fully disable TSX usage in libpthread.so.0. I don't have easy
> access to a box that I can do this kind of testing on e.g. micrcode updates
> etc.

In the mean time I've installed this update:

https://admin.fedoraproject.org/updates/dracut-038-29.git20140903.fc21,kernel-3.16.3-302.fc21

Which fixes things in a less of a big hammer approach, and that fixes things too.

I can still reproduce the problem by booting an older kernel though. I've verified that booting an older kernel still exhibits the problem, then I've installed your glibc scratch build, and I can confirm that the problem is gone, even when using the older kernel, when using the glibc from the scratch build.

Comment 16 Fedora Update System 2014-09-27 09:42:15 UTC
Package glibc-2.18-16.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing glibc-2.18-16.fc20'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-11586/glibc-2.18-16.fc20
then log in and leave karma (feedback).

Comment 17 Carlos O'Donell 2014-09-27 16:50:47 UTC
(In reply to Hans de Goede from comment #15)
> Which fixes things in a less of a big hammer approach, and that fixes things
> too.
> 
> I can still reproduce the problem by booting an older kernel though. I've
> verified that booting an older kernel still exhibits the problem, then I've
> installed your glibc scratch build, and I can confirm that the problem is
> gone, even when using the older kernel, when using the glibc from the
> scratch build.

We are going to push the new glibc into F21 to fix this problem for anyone that doesn't want to upgrade their kernel.

I think the conservative approach of a kernel fix, and runtime fix is the best here given that missing an update could cause your box to break if you install a new microcode_ctl.

Final F21 build here:
http://koji.fedoraproject.org/koji/taskinfo?taskID=7709522

Comment 18 Zbigniew Jędrzejewski-Szmek 2014-09-27 23:10:41 UTC
*** Bug 1147118 has been marked as a duplicate of this bug. ***

Comment 19 Carlos O'Donell 2014-09-28 16:49:00 UTC
OK, final update for FC21 with a full fix was just pushed into Bodhi.

https://admin.fedoraproject.org/updates/FEDORA-2014-11673/glibc-2.20-5.fc21

Comment 20 Fedora Update System 2014-09-29 04:04:45 UTC
glibc-2.18-16.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 21 Amit Shah 2014-09-29 06:01:08 UTC
Given the kernel is going to inject the microcode on start, and no execution of 'cpuid' is going to see those tmem instructions being exposed by the cpu, the glibc fix may not be needed at all?  I think we can revert the glibc fix given the kernel has been fixed appropriately already.  Relevant kernel bug is bug 1083716.

Comment 22 Siddhesh Poyarekar 2014-09-29 08:48:46 UTC
(In reply to Amit Shah from comment #21)
> Given the kernel is going to inject the microcode on start, and no execution
> of 'cpuid' is going to see those tmem instructions being exposed by the cpu,
> the glibc fix may not be needed at all?  I think we can revert the glibc fix
> given the kernel has been fixed appropriately already.  Relevant kernel bug
> is bug 1083716.

The glibc fix is still useful because it fixes TSX code that sneaked out of the --enable-lock-elision configuration, which could potentially cause problems later.  Also, there is no point keeping the bits enabled because if TSX does get enabled in future microcode and it behaves differently (breaking glibc expectations), we'll have to rush to patch older systems.

We do need to re-enable elision for s390* in f21 and rawhide:

https://lists.fedoraproject.org/pipermail/glibc/2014-September/000062.html

Comment 23 Bojan Smojver 2014-09-29 22:56:03 UTC
That update to F-20, -16, appears broken. See my comments in the update.

I have no idea why xrdp would just hang like that, but going back to either -11 or -14 immediately fixes the problem.

I'm running this on an i686 VM, which is most likely running on VMWare ESX or something like that (I don't control this bit).

Comment 24 Carlos O'Donell 2014-09-29 22:59:16 UTC
(In reply to Bojan Smojver from comment #23)
> That update to F-20, -16, appears broken. See my comments in the update.
> 
> I have no idea why xrdp would just hang like that, but going back to either
> -11 or -14 immediately fixes the problem.
> 
> I'm running this on an i686 VM, which is most likely running on VMWare ESX
> or something like that (I don't control this bit).

Are you able to remote ssh into the box, attache a debugger, and do a backtrace to see where it's hung?

Comment 25 Bojan Smojver 2014-09-29 23:22:47 UTC
(In reply to Carlos O'Donell from comment #24)
 
> Are you able to remote ssh into the box, attache a debugger, and do a
> backtrace to see where it's hung?

Yeah, ssh works. Xrdp runs a couple of precesses, I can attach to both before I login and see. Maybe something that is forked off crashes or something like that. No idea at this point - the logs have nothing useful.

A quick strace of the remaining xrdp processes just has one of them sitting in select.

Comment 26 Bojan Smojver 2014-09-30 03:10:59 UTC
(In reply to Bojan Smojver from comment #25)
 
> Yeah, ssh works. Xrdp runs a couple of precesses, I can attach to both
> before I login and see. Maybe something that is forked off crashes or
> something like that. No idea at this point - the logs have nothing useful.
> 
> A quick strace of the remaining xrdp processes just has one of them sitting
> in select.

Wow - did I make a fool of myself or what? Xrdp works now with -16.

Yesterday when I upgraded, it hung on login. I rebooted the VM. Hung again. I reverted to -11, rebooted. Worked. Installed -14, rebooted, worked.

No idea...

Anyhow, must have been some other condition somewhere that conicided with glibc upgrade or something.

Comment 27 Carlos O'Donell 2014-09-30 03:47:18 UTC
(In reply to Bojan Smojver from comment #26)
> Wow - did I make a fool of myself or what? Xrdp works now with -16.

Bojan, You are not a fool. I am incredibly appreciative of people like you who are willing to step up and say something is broken and help out. I have infinite patience for that kind of dedication. Thank you for raising the issue. I'm glad it turned out to be nothing, but it might not have been.

Comment 28 Fedora Update System 2014-10-03 04:05:44 UTC
glibc-2.20-5.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 29 Boris Ranto 2014-10-31 12:22:08 UTC
We are seeing something similar to this in rados binary [1]. On shutdown, the program calls rados_shutdown() which calls the appropriate destructors. In particular, it calls ~RWLock() which issues pthread_rwlock_unlock(). This causes the program to receive SIGILL signal. Debugging with gdb, it seems that the instruction that causes this is xend [2] which is an Intel TSX instruction.

I am no expert in this matter but it seems that this issue is not fully resolved, yet. I've looked at the patches and xbegin seems to be explicitly disabled there. Maybe, we need to explicitly disable xend in the code as well?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1144794 ; in short just run 'rados df' to reproduce (you need to have ceph-common installed)

[2] layout asm in gdb shows this as the crashing line:

>│0x7ffff6c75153 <__GI___pthread_rwlock_unlock+19>        xend   |

Comment 30 Carlos O'Donell 2014-10-31 12:59:43 UTC
(In reply to Boris Ranto from comment #29)
> We are seeing something similar to this in rados binary [1]. On shutdown,
> the program calls rados_shutdown() which calls the appropriate destructors.
> In particular, it calls ~RWLock() which issues pthread_rwlock_unlock(). This
> causes the program to receive SIGILL signal. Debugging with gdb, it seems
> that the instruction that causes this is xend [2] which is an Intel TSX
> instruction.
> 
> I am no expert in this matter but it seems that this issue is not fully
> resolved, yet. I've looked at the patches and xbegin seems to be explicitly
> disabled there. Maybe, we need to explicitly disable xend in the code as
> well?
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1144794 ; in short just run
> 'rados df' to reproduce (you need to have ceph-common installed)
> 
> [2] layout asm in gdb shows this as the crashing line:
> 
> >│0x7ffff6c75153 <__GI___pthread_rwlock_unlock+19>        xend   |

Please open a new bug for this and we can triage there.

The 2.20-5 version should have fixed this.

Comment 31 Robert Hancock 2015-05-30 00:10:43 UTC
*** Bug 1146749 has been marked as a duplicate of this bug. ***