1497720 – Beta-1.5: Newly installed system does not boot on ppc64le/ppc64 baremetal machine.

Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.

Bug 1497720 - Beta-1.5: Newly installed system does not boot on ppc64le/ppc64 baremetal machine.

Summary: Beta-1.5: Newly installed system does not boot on ppc64le/ppc64 baremetal mac...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	27
Hardware:	ppc64le
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	PPCTracker
TreeView+	depends on / blocked

Reported:	2017-10-02 14:04 UTC by Éric Fintzel
Modified:	2018-11-27 23:21 UTC (History)
CC List:	25 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-27 23:21:56 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmesg log for ppc64le (178.85 KB, text/plain) 2017-10-02 14:06 UTC, Éric Fintzel	no flags	Details
dmesg log for ppc64 (186.79 KB, text/plain) 2017-10-02 14:06 UTC, Éric Fintzel	no flags	Details
Fedora 26 ppc64le logs (anaconda, dmesg, lshw, lspci) (657.06 KB, application/x-gzip) 2017-10-03 09:52 UTC, Éric Fintzel	no flags	Details
Comparing the F26 and F27 dmesg logs (27.24 KB, text/plain) 2017-10-17 20:18 UTC, Ben Crocker	no flags	Details
dmesg log for ppc64le (Minimal Server install, no kworker error) (336.42 KB, text/plain) 2017-10-27 14:44 UTC, Éric Fintzel	no flags	Details
View All

Description Éric Fintzel 2017-10-02 14:04:59 UTC

Using Beta-1.5 DVD Server ISO.
Occurs on ppc64le and ppc64.

After a successful installation on a Power8 baremetal machine, unable to reboot until the login prompt. Instead some errors messages and information (regarding PCI, atombios, radeon) are displayed, then we got a :/# prompt with a limited shell.

Note Fedora 26 is able to install and reboot successfully on this machine.

The dmesg logs (for ppc64le and ppc64) are provided as attachments.

Comment 1 Éric Fintzel 2017-10-02 14:06:22 UTC

Created attachment 1333249 [details]
dmesg log for ppc64le

Comment 2 Éric Fintzel 2017-10-02 14:06:59 UTC

Created attachment 1333250 [details]
dmesg log for ppc64

Comment 3 Dan Horák 2017-10-02 14:34:04 UTC

Eric, what machine type is it?

Comment 4 Éric Fintzel 2017-10-03 07:42:19 UTC

Dan, machine type:

Power8
Server-8247-21L-SN212907A
FW840.00 (TV840_056)
Machine type-model: 8247-21L
Serial number: 212907A

Comment 5 Éric Fintzel 2017-10-03 07:44:16 UTC

Note that Beta-1.5 (as 1.3 and 1.2) was tested on a LPAR (ppc64le and ppc64 modes) with success: install + reboot OK.

Comment 6 Dan Horák 2017-10-03 07:59:29 UTC

What I don't get is why the installation passed, there should be the same kernel used, but the first system boot fails.

Eric, I suppose your machine has a discreet Radeon card plugged in, right?

Comment 7 Éric Fintzel 2017-10-03 09:52:33 UTC

Created attachment 1333572 [details]
Fedora 26 ppc64le logs (anaconda, dmesg, lshw, lspci)

Comment 8 Éric Fintzel 2017-10-03 09:53:01 UTC

Dan, I do not have physical access to this machine (in the Toulouse lab).

I re-installed Fedora 26 on it to capture hardware information. I provide the F26-ppc64le.tgz archive containing the anaconda logs, dmesg, lshw and lspci outputs (with some references to Radeon).

I always used this machine as a headless one with VNC, and it never interfered with Radeon before F27.

Comment 9 Éric Fintzel 2017-10-16 13:36:37 UTC

Adding:

modprobe.blacklist=radeon

to the kernel boot arguments disables radeon detection and allows to boot until the login prompt.

This can be used as a bypass to verify the Anaconda installation.

Comment 10 Dan Horák 2017-10-17 14:57:54 UTC

adding Ben to CC for his opinion

Ben, doesn't this problem sound familiar to you? Looks to me that newer kernels (4.13?) don't like a Radeon card in a Power8 system.

Thanks, Dan

Comment 11 Ben Crocker 2017-10-17 20:14:58 UTC

I have a similar configuration, except that it has more CPUs and more RAM,
but it DOES have the same CEDAR (FirePro 2270) graphics card.

I am running F26 on this system.

See next comment for comments on the dmesg logs you sent for F26 and F27,
repectively.

Comment 12 Ben Crocker 2017-10-17 20:18:31 UTC

Created attachment 1339865 [details]
Comparing the F26 and F27 dmesg logs

See the attachment for the promised comparison between the
F26 and F27 dmesg logs.

Comment 13 Jérôme Glisse 2017-10-19 00:20:34 UTC

Just as data point that line:
pci 0001:03     : [PE# 00] Using 64-bit DMA iommu bypass (through TVE#0)

is suspicious and likely one of the root issue but i am not sure what DMA iommu bypass means.

Comment 14 Éric Fintzel 2017-10-27 14:44:13 UTC

Created attachment 1344328 [details]
dmesg log for ppc64le (Minimal Server install, no kworker error)

Interestingly, when the Minimal Server install is used, the reboot after installation goes on until the login prompt. The kworker process error does not occur.

Comment 15 Ben Crocker 2017-11-15 16:10:05 UTC

I built an absolute-latest (4.14.0-rc4) kernel and it appeared to boot OK.
However, the dmesg log shows a lot of troublesome messages concerning
the graphics card, e.g.:

[nnnnn.nnnnnn] [drm] initializing kernel modesetting (CEDAR 0x1002:0x68F2 0x1002:0x0126 0x00).
[nnnnn.nnnnnn] pci 0001:09     : [PE# 02] Using 64-bit DMA iommu bypass (through TVE#0)
...
[nnnnn.nnnnnn] EEH: Frozen PHB#1-PE#2 detected
[nnnnn.nnnnnn] EEH: PE location: U78CB.001.WZS00U9-P1-C12, PHB location: N/A
[nnnnn.nnnnnn] [c000000ff438f7f0] [d0000000169de870] r600_irq_init+0x4b8/0x4e0 [radeon]
[nnnnn.nnnnnn] [c000000ff438f830] [d000000016a09570] evergreen_startup+0x1548/0x2e10 [radeon]
[nnnnn.nnnnnn] [c000000ff438f8e0] [d000000016a0b1a8] evergreen_init+0x240/0x480 [radeon]
[nnnnn.nnnnnn] [c000000ff438f950] [d000000016963100] radeon_device_init+0x638/0xd10 [radeon]
[nnnnn.nnnnnn] [c000000ff438f9e0] [d000000016966534] radeon_driver_load_kms+0xec/0x2d0 [radeon]
[nnnnn.nnnnnn] [c000000ff438fa60] [d000000012d2b19c] drm_dev_register+0x1d4/0x290 [drm]
[nnnnn.nnnnnn] [c000000ff438fb00] [d000000012d2c1fc] drm_get_pci_dev+0xc4/0x210 [drm]
[nnnnn.nnnnnn] [c000000ff438fb90] [d000000016960820] radeon_pci_probe+0xc8/0x110 [radeon]
[nnnnn.nnnnnn] EEH: Detected PCI bus error on PHB#1-PE#2
[nnnnn.nnnnnn] EEH: This PCI device has failed 1 times in the last hour
[nnnnn.nnnnnn] EEH: Notify device drivers to shutdown
[nnnnn.nnnnnn] EEH: Collect temporary log
[nnnnn.nnnnnn] EEH: of node=0001:09:00.1
[nnnnn.nnnnnn] EEH: PCI device/vendor: aa681002
[nnnnn.nnnnnn] EEH: PCI cmd/status register: 00100140
[nnnnn.nnnnnn] EEH: PCI-E capabilities and status follow:
[nnnnn.nnnnnn] EEH: PCI-E 00: 0012a010 00648fa1 0000293e 09000d02 
[nnnnn.nnnnnn] EEH: PCI-E 10: 10120000 00000000 00000000 00000000 
[nnnnn.nnnnnn] EEH: PCI-E 20: 00000000 
[nnnnn.nnnnnn] EEH: PCI-E AER capability register set follows:
[nnnnn.nnnnnn] EEH: PCI-E AER 00: 00010001 00000000 00000000 00062030 
[nnnnn.nnnnnn] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000 
[nnnnn.nnnnnn] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 
[nnnnn.nnnnnn] EEH: PCI-E AER 30: 00000000 00000000 
[nnnnn.nnnnnn] EEH: of node=0001:09:00.0
[nnnnn.nnnnnn] EEH: PCI device/vendor: 68f21002
[nnnnn.nnnnnn] EEH: PCI cmd/status register: 00100546
[nnnnn.nnnnnn] EEH: PCI-E capabilities and status follow:
[nnnnn.nnnnnn] EEH: PCI-E 00: 0012a010 00648fa1 0000293e 09000d02 
[nnnnn.nnnnnn] EEH: PCI-E 10: 10120000 00000000 00000000 00000000 
[nnnnn.nnnnnn] EEH: PCI-E 20: 00000000 
[nnnnn.nnnnnn] EEH: PCI-E AER capability register set follows:
[nnnnn.nnnnnn] EEH: PCI-E AER 00: 00010001 00000000 00000000 00062030 
[nnnnn.nnnnnn] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000 
[nnnnn.nnnnnn] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 
[nnnnn.nnnnnn] EEH: PCI-E AER 30: 00000000 00000000 
[nnnnn.nnnnnn] EEH: Reset with hotplug activity
[nnnnn.nnnnnn] iommu: Removing device 0001:09:00.1 from group 1
[nnnnn.nnnnnn] [drm:r600_ring_test [radeon]] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xFFFFFFFF)
[nnnnn.nnnnnn] radeon 0001:09:00.0: disabling GPU acceleration
...
[nnnnn.nnnnnn] EEH: 2100000 reads ignored for recovering device at location=U78CB.001.WZS00U9-P1-C12 driver=radeon pci addr=0001:09:00.0
[nnnnn.nnnnnn] EEH: Might be infinite loop in radeon driver
<followed by stack trace>

etc.

Comment 16 Ben Crocker 2017-11-15 16:11:41 UTC

EVENTUALLY the FirePro card gets properly initialized, apparently;
I am able to start the X server and run accelerated GL applications,
e.g. glxgears.

Comment 17 Ben Crocker 2017-11-15 16:18:11 UTC

Particularly troubling are these lines:

[nnnnn.nnnnnn] pci 0001:09     : [PE# 02] Using 64-bit DMA iommu bypass (through TVE#0)
[nnnnn.nnnnnn] EEH: Frozen PHB#1-PE#2 detected <followed by stack trace>

Contrast with this line from older kernels:

[nnnnn.nnnnnn] radeon 0001:09:00.0: Using 32-bit DMA via iommu

Comment 18 Ben Crocker 2017-11-15 16:22:17 UTC

I bisected the kernel to find out where the "bypass" and "Frozen" lines
cropped up, and tracked the behavior down to the following commit:

commit 07d306c838c5c30196619baae36107d0615e459b
Merge: a3ddacbae5ab c013b65ad8a1
Author: Linus Torvalds <torvalds>
Date:   Tue Jul 11 09:59:37 2017 -0700

    Merge git://www.linux-watchdog.org/linux-watchdog
    
    Pull watchdog updates from Wim Van Sebroeck:
    
     - Add Renesas RZ/A WDT Watchdog driver
    
     - STM32 Independent WatchDoG (IWDG) support
    
     - UniPhier watchdog support
    
     - Add F71868 support
    
     - Add support for NCT6793D and NCT6795D
    
     - dw_wdt: add reset lines support
    
     - core: add option to avoid early handling of watchdog
    
     - core: introduce watchdog_worker_should_ping helper
    
     - Cleanups and improvements for sama5d4, intel-mid_wdt, s3c2410_wdt,
       orion_wdt, gpio_wdt, it87_wdt, meson_wdt, davinci_wdt, bcm47xx_wdt,
       zx2967_wdt, cadence_wdt
    
    * git://www.linux-watchdog.org/linux-watchdog: (32 commits)
      watchdog: introduce watchdog_worker_should_ping helper
      watchdog: uniphier: add UniPhier watchdog driver
      dt-bindings: watchdog: add description for UniPhier WDT controller
      watchdog: cadence_wdt: make of_device_ids const.
      watchdog: zx2967: constify zx2967_wdt_ops.
      watchdog: bcm47xx_wdt: constify bcm47xx_wdt_hard_ops and bcm47xx_wdt_soft_ops
      watchdog: davinci: Add missing clk_disable_unprepare().
      watchdog: davinci: Handle return value of clk_prepare_enable
      watchdog: meson: Handle return value of clk_prepare_enable
      watchdog: it87: Add support for various Super-IO chips
      watchdog: it87: Use infrastructure to stop watchdog on reboot
      watchdog: it87: Drop support for resetting watchdog though CIR and Game port
      watchdog: it87: Convert to use watchdog core infrastructure
      watchdog: it87: Drop FSF mailing address
      watchdog: dw_wdt: get reset lines from dt
      watchdog: bindings: dw_wdt: add reset lines
      watchdog: w83627hf: Add support for NCT6793D and NCT6795D
      watchdog: core: add option to avoid early handling of watchdog
      watchdog: f71808e_wdt: Add F71868 support
      watchdog: Add STM32 IWDG driver
      ...

This is a huge commit.  Analysis continues.

Comment 19 Ben Crocker 2017-11-18 00:33:40 UTC

I bisected the problem with different endpoints, and tracked it down to this commit:

commit 8e3f1b1d8255105f31556aacf8aeb6071b00d469
Author: Russell Currey <ruscur>
Date:   Wed Jun 21 17:18:04 2017 +1000

    powerpc/powernv/pci: Enable 64-bit devices to access >4GB DMA space
  
    On PHB3/POWER8 systems, devices can select between two different sections
    of address space, TVE#0 and TVE#1.  TVE#0 is intended for 32bit devices
    that aren't capable of addressing more than 4GB.  Selecting TVE#1 instead,
    with the capability of addressing over 4GB, is performed by setting bit 59
    of a PCI address.
  
    However, some devices aren't capable of addressing at least 59 bits, but
    still want more than 4GB of DMA space.  In order to enable this, reconfigure
    TVE#0 to be suitable for 64-bit devices by allocating memory past the
    initial 4GB that is inaccessible by 64-bit DMAs.
  
    This bypass mode is only enabled if a device requests 4GB or more of DMA
    address space, if the system has PHB3 (POWER8 systems), and if the device
    does not share a PE with any devices from different vendors.
  
    Signed-off-by: Russell Currey <ruscur>
    Signed-off-by: Michael Ellerman <mpe.au>

The later commit that fixed at least some problems introduced by
8e3f1b1d8255105f31556aacf8aeb6071b00d469:

commit 253fd51e2f533552ae35a0c661705da6c4842c1b
Author: Alistair Popple <alistair.au>
Date:   Wed Jul 26 15:26:40 2017 +1000

    powerpc/powernv/pci: Return failure for some uses of dma_set_mask()

    Commit 8e3f1b1d8255 ("powerpc/powernv/pci: Enable 64-bit devices to access
    >4GB DMA space") introduced the ability for PCI device drivers to request a
    DMA mask between 64 and 32 bits and actually get a mask greater than
    32-bits. However currently if certain machine configuration dependent
    conditions are not meet the code silently falls back to a 32-bit mask.

    This makes it hard for device drivers to detect which mask they actually
    got. Instead we should return an error when the request could not be
    fulfilled which allows drivers to either fallback or implement other
    workarounds as documented in DMA-API-HOWTO.txt.

    Signed-off-by: Alistair Popple <alistair.au>
    Acked-by: Russell Currey <ruscur>
    Signed-off-by: Michael Ellerman <mpe.au>

appears not to be the answer, at least not for the AMD FirePro 2270
on which the problem was reported.

The signature of the problem is that as soon as we start using the
64-bit DMA iommu bypass (through TVE#0) for the Radeon card, we get
"EEH: Frozen PHB#1-PE#2 detected" messages and a TON of EEH messages
with stack traces from the Radeon driver, messages about atombios
[being] stuck in a loop, etc.

I have not yet had a chance to test this with a different card.

Comment 20 Ben Crocker 2018-01-09 14:59:25 UTC

I have now had a chance to swap out the FirePro 2270 for
an Embedded Radeon E6465 (with a newer Caicos GPU and more VRAM),
and this problem does not occur.

Comment 21 Laura Abbott 2018-02-20 19:53:43 UTC

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  As kernel maintainers, we try to keep up with bugzilla but due the rate at which the upstream kernel project moves, bugs may be fixed without any indication to us. Due to this, we are doing a mass bug update across all of the Fedora 27 kernel bugs.
 
Fedora 27 has now been rebased to 4.15.3-300.f27.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you experience different issues, please open a new bug report for those.

Comment 22 Ben Crocker 2018-02-22 00:12:12 UTC

This problem may be fixed, or at least worked around, by this patch I
submitted upstream 02/21/2018:

[PATCH] drm/radeon: insist on 32-bit DMA for Cedar

In radeon_device_init, set the need_dma32 flag for Cedar chips
(e.g. FirePro 2270).  This fixes, or at least works around, a bug
on PowerPC exposed by last year's commits

8e3f1b1d8255105f31556aacf8aeb6071b00d469 (Russell Currey)

and

253fd51e2f533552ae35a0c661705da6c4842c1b (Alistair Popple)

which enabled the 64-bit DMA iommu bypass.

This caused the device to freeze, in some cases unrecoverably, and is
the subject of several bug reports internal to Red Hat.

Signed-off-by: Ben Crocker <bcrocker>
---
 drivers/gpu/drm/radeon/radeon_device.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c
index ffc10cadcf34..02538903830d 100644
--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -1395,7 +1395,10 @@ int radeon_device_init(struct radeon_device *rdev,
        if (rdev->flags & RADEON_IS_AGP)
                rdev->need_dma32 = true;
        if ((rdev->flags & RADEON_IS_PCI) &&
-           (rdev->family <= CHIP_RS740))
+           (rdev->family <= CHIP_RS740 || rdev->family == CHIP_CEDAR))
+               rdev->need_dma32 = true;
+       if ((rdev->flags & RADEON_IS_PCIE) &&
+           (rdev->family == CHIP_CEDAR))
                rdev->need_dma32 = true;

        dma_bits = rdev->need_dma32 ? 32 : 40;

Comment 23 Ben Crocker 2018-02-22 18:57:19 UTC

Here is a revised version of the patch; I think it's in final form now:


Author: Ben Crocker <bcrocker>
Date:   Thu Feb 22 17:50:45 2018 -0500

    drm/radeon: insist on 32-bit DMA for Cedar on PPC64/PPC64LE
    
    In radeon_device_init, set the need_dma32 flag for Cedar chips
    (e.g. FirePro 2270).  This fixes, or at least works around, a bug
    on PowerPC exposed by last year's commits
    
    8e3f1b1d8255105f31556aacf8aeb6071b00d469 (Russell Currey)
    
    and
    
    253fd51e2f533552ae35a0c661705da6c4842c1b (Alistair Popple)
    
    which enabled the 64-bit DMA iommu bypass.
    
    This caused the device to freeze, in some cases unrecoverably, and is
    the subject of several bug reports internal to Red Hat.
    
    Signed-off-by: Ben Crocker <bcrocker>

diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c
index ffc10cadcf34..32b577c776b9 100644
--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -1397,6 +1397,10 @@ int radeon_device_init(struct radeon_device *rdev,
        if ((rdev->flags & RADEON_IS_PCI) &&
            (rdev->family <= CHIP_RS740))
                rdev->need_dma32 = true;
+#ifdef CONFIG_PPC64
+       if (rdev->family == CHIP_CEDAR)
+               rdev->need_dma32 = true;
+#endif
 
        dma_bits = rdev->need_dma32 ? 32 : 40;
        r = pci_set_dma_mask(rdev->pdev, DMA_BIT_MASK(dma_bits));

Comment 25 Laura Abbott 2018-10-01 21:25:21 UTC

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 27 kernel bugs.
 
Fedora 27 has now been rebased to 4.18.10-100.fc27.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 28 or Fedora 29, and are still experiencing this issue, please change the version to Fedora 28 or 29.
 
If you experience different issues, please open a new bug report for those.

Comment 26 Ben Cotton 2018-11-27 14:12:12 UTC

This message is a reminder that Fedora 27 is nearing its end of life.
On 2018-Nov-30  Fedora will stop maintaining and issuing updates for
Fedora 27. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora  'version' of '27'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 27 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 27 Éric Fintzel 2018-11-27 23:21:56 UTC

Checked with Fedora 27 Server official ISO images (for ppc64 and ppc64le).
The problem does not occur anymore, so change status to CLOSED.

Note You need to log in before you can comment on or make changes to this bug.

airlied
bcrocker
bskeggs
dan
efintzel
eparis
esandeen
hdegoede
ichavero
itamar
jarodwilson
jforbes
jglisse
jonathan
josef
jwboyer
kernel-maint
labbott
linville
mchehab
mjg59
nhorman
normand
quintela
steved