Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1318596
Summary: | unable to handle kernel NULL pointer dereference at (null) in _find_next_bit | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Richard W.M. Jones <rjones> | ||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | rawhide | CC: | gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, mchehab | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-03-18 13:59:50 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 910269 | ||||||
Attachments: |
|
Description
Richard W.M. Jones
2016-03-17 10:25:18 UTC
Is this plain QEMU or KVM? I tested this kernel on a KVM guest before I sent it to koji. Plain QEMU (software emulation / TCG). I wonder if it could be a bug in TCG? Here's the disassembly with the apparently failing instruction marked with asterisks: ffffffff814698a0 <_find_next_bit.part.0>: ffffffff814698a0: 48 89 c8 mov %rcx,%rax ffffffff814698a3: 48 89 d1 mov %rdx,%rcx ffffffff814698a6: 49 c7 c0 ff ff ff ff mov $0xffffffffffffffff,%r8 ffffffff814698ad: 48 c1 e9 06 shr $0x6,%rcx ffffffff814698b1: 49 89 c1 mov %rax,%r9 ffffffff814698b4: 55 push %rbp ffffffff814698b5: 4c 33 0c cf xor (%rdi,%rcx,8),%r9 ****** ffffffff814698b9: 89 d1 mov %edx,%ecx ffffffff814698bb: 48 83 e2 c0 and $0xffffffffffffffc0,%rdx Doesn't appear to be doing anything very strange that would cause TCG to fail, so I'm guessing it's not a decoding failure or a brand new instruction. It looks as if the failing path is: cpumask_any_but -> calls for_each_cpu -> calls cpumask_next -> calls find_next_bit This is a single CPU virtual machine. I'm a bit lost after that, but the first parameter of find_next_bit appears to be NULL for some reason. I'm guessing this is probably because of commit c25323c07345a843a56a294047b130dfd9250fad, where the topology_core_cpumask that was added to calibrate_delay_is_known is interacting badly with your emulated machine. It would be helpful if you could distill the log down to a command we can use to run qemu in a similar setup. Created attachment 1137468 [details]
reproducer.sh
A reproducer is attached. It just requires the kernel + qemu.
The test is very timing sensitive. I found that it only reproduced
about 1 in 10 times. It seems more likely to reproduce if the host
machine is busy. I did a bunch of git pulls and kernel compiles at the
same time, and that seems to make it reproduce more reliably.
I should note that it's expected that the kernel will panic because there is no initramfs nor root filesystem. If you hit that panic, then you *didn't* reproduce the bug. You only reproduce the bug if the kernel crashes with the stack trace shown in comment 0. Finally worked out the right incantation to run the script over and over again until you hit the failure: while ./reproducer.sh >& /tmp/log ; ! grep -sq calibrate_delay_is_known /tmp/log; do echo -n .; done (that's all on a single line) Looks like we have a fix upstream: https://lkml.org/lkml/2016/3/18/74 Patch included in the rc8.git8.1.fc25 build. Thanks for the report and testing. (In reply to Josh Boyer from comment #10) > Patch included in the rc8.git8.1.fc25 build. Thanks for the report and > testing. Er, rc0.git8.1.fc25 obviously. Sigh. |