Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1807668 - kernel panic on Penguin Computing Valkre 2040 Gigabyte Blade ARM (CN88xx)
Summary: kernel panic on Penguin Computing Valkre 2040 Gigabyte Blade ARM (CN88xx)
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 31
Hardware: aarch64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Mark Salter
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: ARMTracker
TreeView+ depends on / blocked
 
Reported: 2020-02-26 21:17 UTC by Rachel Sibley
Modified: 2020-03-15 22:12 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-15 22:12:44 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Rachel Sibley 2020-02-26 21:17:22 UTC
1. Please describe the problem:
During kernel CI testing of an upstream kernel 5.6.0-rc1 with Fedora 31 GA distro, we are seeing a kernel panic on the Penguin Valkre 2040 Gigabyte Blade ARM (CN88xx) systems, we have also experienced a hang with a similar system Gigabyte R120 ARM (CN88xx) server as well using the same kernel.

2. What is the Version-Release number of the kernel:
5.6.0-rc1

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

I have only seen it happen with 5.6.0-rc1 kernel


4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Yes, I have seen it 3 times already.


5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Not sure yet.

6. Are you running any modules that not shipped with directly Fedora's kernel?:
No

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Full test logs, kernel tarball, and configs are found here:
https://cki-artifacts.s3.us-east-2.amazonaws.com/index.html?prefix=datawarehouse/2020/02/25/457325

The logs of interest is
https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/02/25/457325/aarch64_1_console.log

Comment 1 Jeff Bastian 2020-02-26 21:25:52 UTC
Copy of the console log:

Linux version 5.6.0-rc1-214527a.cki
[    0.000000] Machine model: Cavium ThunderX CN88XX board 
[    0.000000] earlycon: pl11 at MMIO 0x000087e024000000 (options '') 
[    0.000000] printk: bootconsole [pl11] enabled 
[    0.000000] efi: Getting EFI parameters from FDT: 
[    0.000000] efi: EFI v2.40 by American Megatrends 
[    0.000000] efi:  ESRT=0x10ffe0dff18  SMBIOS 3.0=0xfffb0000  ACPI 2.0=0x10ffff30000  MEMRESERVE=0x10ffdc72e18  
[    0.000000] esrt: Reserving ESRT space from 0x0000010ffe0dff18 to 0x0000010ffe0dff50. 
[    0.000000] cma: Reserved 64 MiB at 0x00000000fbc00000 
[    0.000000] OF: NUMA: parsing numa-distance-map-v1 
[    0.000000] NUMA: Warning: invalid memblk node 512 [mem 0x00500000-0x00dfffff] 
[    0.000000] NUMA: Faking a node at [mem 0x0000000000500000-0x0000010fffffffff] 
[    0.000000] NUMA: NODE_DATA [mem 0x10fef0c6300-0x10fef0d7fff] 
[    0.000000] Zone ranges: 
[    0.000000]   DMA      [mem 0x0000000000500000-0x000000003fffffff] 
[    0.000000]   DMA32    [mem 0x0000000040000000-0x00000000ffffffff] 
[    0.000000]   Normal   [mem 0x0000000100000000-0x0000010fffffffff] 
[    0.000000] Movable zone start for each node 
[    0.000000] Early memory node ranges 
[    0.000000]   node   0: [mem 0x0000000000500000-0x0000000000dfffff] 
[    0.000000]   node   0: [mem 0x0000000000e00000-0x000000000fffffff] 
[    0.000000]   node   0: [mem 0x0000000010000000-0x000000001012bfff] 
[    0.000000]   node   0: [mem 0x000000001012c000-0x00000000fff9ffff] 
[    0.000000]   node   0: [mem 0x00000000fffa0000-0x00000000ffffffff] 
[    0.000000]   node   0: [mem 0x0000000100000000-0x0000000fff0fffff] 
[    0.000000]   node   0: [mem 0x0000010001400000-0x0000010ff5eaffff] 
[    0.000000]   node   0: [mem 0x0000010ff5eb0000-0x0000010ff5ecffff] 
[    0.000000]   node   0: [mem 0x0000010ff5ed0000-0x0000010ffb92ffff] 
[    0.000000]   node   0: [mem 0x0000010ffb930000-0x0000010ffc22ffff] 
[    0.000000]   node   0: [mem 0x0000010ffc230000-0x0000010ffda2ffff] 
[    0.000000]   node   0: [mem 0x0000010ffda30000-0x0000010ffda9ffff] 
[    0.000000]   node   0: [mem 0x0000010ffdaa0000-0x0000010ffdbeffff] 
[    0.000000]   node   0: [mem 0x0000010ffdbf0000-0x0000010ffdc3ffff] 
[    0.000000]   node   0: [mem 0x0000010ffdc40000-0x0000010ffdd0ffff] 
[    0.000000]   node   0: [mem 0x0000010ffdd10000-0x0000010ffdd1ffff] 
[    0.000000]   node   0: [mem 0x0000010ffdd20000-0x0000010ffe08ffff] 
[    0.000000]   node   0: [mem 0x0000010ffe090000-0x0000010ffe3fffff] 
[    0.000000]   node   0: [mem 0x0000010ffe400000-0x0000010fffefffff] 
[    0.000000]   node   0: [mem 0x0000010ffff00000-0x0000010ffff2ffff] 
[    0.000000]   node   0: [mem 0x0000010ffff30000-0x0000010ffff3ffff] 
[    0.000000]   node   0: [mem 0x0000010ffff40000-0x0000010ffffeffff] 
[    0.000000]   node   0: [mem 0x0000010fffff0000-0x0000010fffffffff] 
[    0.000000] Zeroed struct page in unavailable ranges: 1312 pages 
[    0.000000] Initmem setup node 0 [mem 0x0000000000500000-0x0000010fffffffff] 
[    0.000000] psci: probing for conduit method from DT. 
[    0.000000] psci: PSCIv0.2 detected in firmware. 
[    0.000000] psci: Using standard PSCI v0.2 function IDs 
[    0.000000] psci: Trusted OS resident on physical CPU 0x0 
[    0.000000] percpu: Embedded 31 pages/cpu s87512 r8192 d31272 u126976 
[    0.000000] Detected VIPT I-cache on CPU0 
[    0.000000] CPU features: detected: GIC system register CPU interface 
[    0.000000] CPU features: detected: Software prefetching using PRFM 
[    0.000000] CPU features: detected: Cavium erratum 27456 
[    0.000000] CPU features: detected: Cavium erratum 30115 
[    0.000000] CPU features: kernel page table isolation forced OFF by ARM64_WORKAROUND_CAVIUM_27456 
[    0.000000] ARM_SMCCC_ARCH_WORKAROUND_1 missing from firmware 
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 33020064 
[    0.000000] Policy zone: Normal 
[    0.000000] Kernel command line: BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.6.0-rc1-214527a.cki root=/dev/mapper/fedora_penguin--valkre2040--01-root ro arm-smmu.disable_bypass=n earlycon=pl011,0x87e024000000 iommu.passthrough=1 rd.lvm.lv=fedora_penguin-valkre2040-01/root rd.lvm.lv=fedora_penguin-valkre2040-01/swap console=ttyAMA0 
[    0.000000] printk: log_buf_len individual max cpu contribution: 4096 bytes 
[    0.000000] printk: log_buf_len total cpu_extra contributions: 389120 bytes 
[    0.000000] printk: log_buf_len min size: 262144 bytes 
[    0.000000] printk: log_buf_len: 1048576 bytes 
[    0.000000] printk: early log buf free: 255900(97%) 
[    0.000000] Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes, linear) 
[    0.000000] Inode-cache hash table entries: 4194304 (order: 13, 33554432 bytes, linear) 
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off 
[    0.000000] software IO TLB: mapped [mem 0x3bfff000-0x3ffff000] (64MB) 
[    0.000000] Memory: 131492696K/134176768K available (12220K kernel code, 2752K rwdata, 6280K rodata, 6400K init, 8257K bss, 2618536K reserved, 65536K cma-reserved) 
[    0.000000] random: get_random_u64 called from kmem_cache_open+0x30/0x4a0 with crng_init=0 
[    0.000000] SLUB: HWalign=128, Order=0-3, MinObjects=0, CPUs=96, Nodes=1 
[    0.000000] ftrace: allocating 43265 entries in 170 pages 
[    0.000000] ftrace: allocated 170 pages with 4 groups 
[    0.000000] rcu: Hierarchical RCU implementation. 
[    0.000000] rcu: 	RCU restricting CPUs from NR_CPUS=4096 to nr_cpu_ids=96. 
[    0.000000] 	Tasks RCU enabled. 
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies. 
[    0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=96 
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0 
[    0.000000] GICv3: GIC: Using split EOI/Deactivate mode 
[    0.000000] GICv3: 128 SPIs implemented 
[    0.000000] GICv3: 0 Extended SPIs implemented 
[    0.000000] Internal error: synchronous external abort: 96000210 [#1] SMP 
[    0.000000] Modules linked in: 
[    0.000000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.6.0-rc1-214527a.cki #1 
[    0.000000] Hardware name: Cavium ThunderX CN88XX board (DT) 
[    0.000000] pstate: 60400085 (nZCv daIf +PAN -UAO) 
[    0.000000] pc : __raw_readl+0x0/0x8 
[    0.000000] lr : gic_init_bases+0x110/0x4b0 
[    0.000000] sp : ffff8000118e3dd0 
[    0.000000] x29: ffff8000118e3dd0 x28: 00000000020a0018  
[    0.000000] x27: 0000000000000018 x26: 0000000000000000  
[    0.000000] x25: 0000000000000002 x24: ffff010fe7f02680  
[    0.000000] x23: 0000000000000000 x22: ffff800010d129c0  
[    0.000000] x21: ffff010fef13b990 x20: 00000000009b0404  
[    0.000000] x19: ffff8000118f6340 x18: 0000000000000005  
[    0.000000] x17: 000000005ec1b98f x16: 00000000f5fbaae0  
[    0.000000] x15: 0000000000000010 x14: ffffffffffffffff  
[    0.000000] x13: ffff8000918e3b5f x12: ffff8000118e3b6c  
[    0.000000] x11: ffff80001192a000 x10: ffff80001105b630  
[    0.000000] x9 : ffff80001017d330 x8 : 0000000000000068  
[    0.000000] x7 : 000000000000000d x6 : ffff800011c59be9  
[    0.000000] x5 : 0000000000000001 x4 : 0000000000000000  
[    0.000000] x3 : 0000000000000000 x2 : 00000000ffffffff  
[    0.000000] x1 : ffff80001192ad78 x0 : ffff80001263000c  
[    0.000000] Call trace: 
[    0.000000]  __raw_readl+0x0/0x8 
[    0.000000]  gic_of_init+0x170/0x1f8 
[    0.000000]  of_irq_init+0x1e4/0x3c4 
[    0.000000]  irqchip_init+0x1c/0x40 
[    0.000000]  init_IRQ+0xd8/0x108 
[    0.000000]  start_kernel+0x3e8/0x574 
[    0.000000] Code: 1a8007e0 d65f03c0 d538d080 d65f03c0 (b9400000)  
[    0.000000] ---[ end trace fb262ad5a6fb6046 ]--- 
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task! 
[    0.000000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

Comment 2 Jeff Bastian 2020-02-26 21:32:43 UTC
The 5.5.5-200.fc31 kernel was used to install the 5.6.0-rc1 kernel on this system, so it looks like there is a regression in the new 5.6 kernel.

Comment 3 Paul Whalen 2020-02-27 00:06:58 UTC
I think I saw this on the ThunderX2, and it was fixed with RC2. Have you tried later kernels? (RC3 is also out now)

Comment 4 Jeff Bastian 2020-02-27 22:07:07 UTC
Unfortunately 5.6 rc3 fails the same way:

[    0.000000] Internal error: synchronous external abort: 96000210 [#1] SMP
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.6.0-0.rc3.1.elrdy.aarch64 #1
[    0.000000] Hardware name: Cavium ThunderX CN88XX board (DT)
[    0.000000] pstate: 60000085 (nZCv daIf -PAN -UAO)
[    0.000000] pc : __raw_readl+0x0/0x8
[    0.000000] lr : gic_init_bases+0x110/0x57c
[    0.000000] sp : ffff8000113efd60
[    0.000000] x29: ffff8000113efd60 x28: 0000000000000000 
[    0.000000] x27: 0000000000000018 x26: 0000000000000002 
[    0.000000] x25: ffff800012ac0000 x24: ffff01000146af00 
[    0.000000] x23: 0000000000000000 x22: ffff010ffeed3760 
[    0.000000] x21: ffff800010b42a40 x20: 00000000009b0404 
[    0.000000] x19: ffff800011420290 x18: 0000000000000010 
[    0.000000] x17: 00000000b74a6603 x16: 0000000000000000 
[    0.000000] x15: ffffffffffffffff x14: ffff800011413948 
[    0.000000] x13: ffff8000913efad7 x12: ffff8000113efae4 
[    0.000000] x11: ffff800011451000 x10: ffff8000113efa60 
[    0.000000] x9 : ffff80001016a864 x8 : ffff800010670c98 
[    0.000000] x7 : 0000000000000063 x6 : ffff800011665bf9 
[    0.000000] x5 : 0000000000000001 x4 : 0000000000000000 
[    0.000000] x3 : 0000000000000000 x2 : 0000000000000000 
[    0.000000] x1 : 0000000000000000 x0 : ffff800012ac000c 
[    0.000000] Call trace:
[    0.000000]  __raw_readl+0x0/0x8
[    0.000000]  gic_of_init+0x184/0x220
[    0.000000]  of_irq_init+0x204/0x3d0
[    0.000000]  irqchip_init+0x1c/0x40
[    0.000000]  init_IRQ+0xe0/0x150
[    0.000000]  start_kernel+0x5c4/0x7a0
[    0.000000] Code: 52800000 d65f03c0 d538d080 d65f03c0 (b9400000) 
[    0.000000] ---[ end trace 38f01c1d6a66ca51 ]---
[    0.000000] Kernel panic - not syncing: Fatal exception
[    0.000000] ---[ end Kernel panic - not syncing: Fatal exception ]---



I also tried booting with ACPI instead of DeviceTree and the it still panics, although the trace is a little different:

[    0.000000] Internal error: synchronous external abort: 96000210 [#1] SMP 
[    0.000000] Modules linked in: 
[    0.000000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.6.0-0.rc3.1.elrdy.aarch64 #1 
[    0.000000] Hardware name: Cavium ThunderX CN88XX board (DT) 
[    0.000000] pstate: 60000085 (nZCv daIf -PAN -UAO) 
[    0.000000] pc : __raw_readl+0x0/0x8 
[    0.000000] lr : gic_init_bases+0x110/0x57c 
[    0.000000] sp : ffff8000113efd30 
[    0.000000] x29: ffff8000113efd30 x28: fffffc1ffe404364  
[    0.000000] x27: 0000000000000001 x26: 0000000000000000  
[    0.000000] x25: ffff8000113efe80 x24: ffff000fc006ae80  
[    0.000000] x23: 0000000000000000 x22: ffff000fc006b000  
[    0.000000] x21: ffff800010b42a40 x20: 00000000009b0404  
[    0.000000] x19: ffff800011420290 x18: 0000000000000010  
[    0.000000] x17: 00000000b74a6603 x16: 0000000000000000  
[    0.000000] x15: ffffffffffffffff x14: ffff800011413948  
[    0.000000] x13: ffff8000913efaa7 x12: ffff8000113efab4  
[    0.000000] x11: ffff800011451000 x10: ffff8000113efa30  
[    0.000000] x9 : ffff80001016a864 x8 : ffff800010670c98  
[    0.000000] x7 : 00000000000000d9 x6 : ffff800011665bf9  
[    0.000000] x5 : 0000000000000001 x4 : 0000000000000000  
[    0.000000] x3 : 0000000000000000 x2 : 0000000000000000  
[    0.000000] x1 : 0000000000000000 x0 : ffff800012ac000c  
[    0.000000] Call trace: 
[    0.000000]  __raw_readl+0x0/0x8 
[    0.000000]  gic_acpi_init+0x130/0x260 
[    0.000000]  acpi_match_madt+0x4c/0x80 
[    0.000000]  acpi_table_parse_entries_array+0x174/0x25c 
[    0.000000]  acpi_table_parse_entries+0x48/0x68 
[    0.000000]  acpi_table_parse_madt+0x2c/0x34 
[    0.000000]  __acpi_probe_device_table+0x88/0xe0 
[    0.000000]  irqchip_init+0x38/0x40 
[    0.000000]  init_IRQ+0xe0/0x150 
[    0.000000]  start_kernel+0x5c4/0x7a0 
[    0.000000] Code: 52800000 d65f03c0 d538d080 d65f03c0 (b9400000)  
[    0.000000] ---[ end trace 38f01c1d6a66ca51 ]--- 
[    0.000000] Kernel panic - not syncing: Fatal exception 
[    0.000000] ---[ end Kernel panic - not syncing: Fatal exception ]---

Comment 5 Mark Salter 2020-02-27 23:46:50 UTC
So, "Internal error: synchronous external abort: 96000210"

The "96" part of the error code means:

  Data Abort taken without a change in Exception level.
  Used for MMU faults generated by data accesses, alignment faults other than those
  caused by Stack Pointer misalignment, and synchronous External aborts, including
  synchronous parity or ECC errors. Not used for debug related exceptions.

The 210 part means:

  * it was an external data abort (it was triggered off core)
  * not on translation table walk
  * it was a read

I'll take a look. Should be easy enough to track down.

Comment 6 Mark Salter 2020-03-07 23:51:12 UTC
So, it took a minute to get a machine to test on, but the problem turns out
to be commit f2d834092ee2 "irqchip/gic-v3: Add GICv4.1 VPEID size discovery"
which includes of the GICv4 TYPER2 register in a path also used by GICv3.
Some GICv3 implementations will return a zero (hopefully) for an unimplemented
register, but ThunderX signals an SEA to the core. The solution is to avoid
the read for GICv3. I sent a patch to do this to the maintainer:

  https://lkml.org/lkml/2020/3/7/254

Comment 7 Mark Salter 2020-03-11 16:55:03 UTC
Marvell responded with an errata and Marc respun the patch accordingly.

  https://lkml.org/lkml/2020/3/11/457

I'll update bug when it lands in Linus' tree.

Comment 8 Mark Salter 2020-03-15 22:12:44 UTC
Fix is in v5.6-rc6


Note You need to log in before you can comment on or make changes to this bug.