1757249 – Trying to boot ppc64 VM on power9 host from image stored on NFS fails

Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.

Bug 1757249 - Trying to boot ppc64 VM on power9 host from image stored on NFS fails

Summary: Trying to boot ppc64 VM on power9 host from image stored on NFS fails

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	30
Hardware:	ppc64le
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1757250 (view as bug list)
Depends On:
Blocks:	PPCTracker
TreeView+	depends on / blocked

Reported:	2019-10-01 00:01 UTC by Adam Williamson
Modified:	2019-10-03 22:08 UTC (History)
CC List:	30 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-03 22:06:42 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Adam Williamson 2019-10-01 00:01:55 UTC

We just got a couple of new power9 boxes that are intended to be used as openQA worker hosts. Unfortunately, they don't seem to be able to boot VMs from a scsi-cd 'drive', which is how 90% of openQA tests run. This is what happens when I try (from a console, using a command which is basically the same as what the openQA worker process does):

====

[root@openqa-ppc64le-02 adamwill][PROD]# /usr/bin/qemu-system-ppc64 -nographic -global isa-fdc.driveA= -m 4096 -machine usb=off -cpu host -netdev user,id=qanet0 -device rtl8139,netdev=qanet0,mac=52:54:00:12:34:56 -device virtio-scsi-pci,id=scsi0 -device nec-usb-xhci -device usb-tablet -device usb-kbd -smp 1 -enable-kvm -no-shutdown -drive media=cdrom,if=none,id=cd0,format=raw,file=/var/lib/openqa/share/factory/iso/Fedora-Server-dvd-ppc64le-Rawhide-20190930.n.0.iso -device scsi-cd,drive=cd0,bus=scsi0.0 -boot once=d,menu=on,splash-time=5000


SLOF **********************************************************************
QEMU Starting
 Build Date = Jan 31 2019 12:46:26
 FW Version = mockbuild@ release 20180702
 Press "s" to enter Open Firmware.

Press F12 for boot menu.

Populating /vdevice methods
Populating /vdevice/vty@71000000
Populating /vdevice/nvram@71000001
Populating /pci@800000020000000
                     00 0000 (D) : 1234 1111    qemu vga
                     00 0800 (D) : 10ec 8139    network [ ethernet ]
                     00 1000 (D) : 1af4 1004    virtio [ scsi ]
Populating /pci@800000020000000/scsi@2
       SCSI: Looking for devices
          100000000000000 CD-ROM   : "QEMU     QEMU CD-ROM      2.5+"
                     00 1800 (D) : 1033 0194    serial bus [ usb-xhci ]
No NVRAM common partition, re-initializing...
Installing QEMU fb



Scanning USB 
  XHCI: Initializing
    USB Keyboard 
No console specified using screen & keyboard
     


  Welcome to Open Firmware

  Copyright (c) 2004, 2017 IBM Corporation All rights reserved.
  This program and the accompanying materials are made available
  under the terms of the BSD License available at
  http://www.opensource.org/licenses/bsd-license.php


Trying to load:  from: /pci@800000020000000/scsi@2/disk@100000000000000 ...   Successfully loaded
SCSI-DISK: /pci@800000020000000/scsi@2/disk@100000000000000:0,read-blocks failed
SCSI-DISK: Status 2 [CHECK CONDITION] Sense b [ABORTED COMMAND] ASC 0 ASCQ 6 
SCSI-DISK: /pci@800000020000000/scsi@2/disk@100000000000000:0,read-blocks failed
SCSI-DISK: Status 2 [CHECK CONDITION] Sense b [ABORTED COMMAND] ASC 0 ASCQ 6 
SCSI-DISK: /pci@800000020000000/scsi@2/disk@100000000000000:0,read-blocks failed
SCSI-DISK: Status 2 [CHECK CONDITION] Sense b [ABORTED COMMAND] ASC 0 ASCQ 6 
SCSI-DISK: /pci@800000020000000/scsi@2/disk@100000000000000:0,read-blocks failed
SCSI-DISK: Status 2 [CHECK CONDITION] Sense b [ABORTED COMMAND] ASC 0 ASCQ 6 

====

this is with:

[root@openqa-ppc64le-02 adamwill][PROD]# rpm -q qemu-system-ppc
qemu-system-ppc-3.1.1-2.fc30.ppc64le
[root@openqa-ppc64le-02 adamwill][PROD]# rpm -q openbios
openbios-20181005-2.git441a84d.fc30.noarch
[root@openqa-ppc64le-02 adamwill][PROD]#

Comment 1 Adam Williamson 2019-10-01 00:04:19 UTC

Same command works fine on a power8 box with the same packages. CCing Cleber Rosa and David Gibson (at Cleber's suggestion).

Comment 2 David Gibson 2019-10-01 01:31:56 UTC

Huh.  Alas I have no quick ideas on this.  Fwiw openbios is not relevant here - the pseries machine uses the SLOF firmware, not openbios.

It's really bizarre that this works on power8 but not power9, since the disk device is fully emulated and should be identical in both cases.

Comment 3 Adam Williamson 2019-10-01 01:45:22 UTC

What package does this 'slof' firmware live in then?

Comment 4 Adam Williamson 2019-10-01 01:47:30 UTC

Also, if it would help, we can probably get you shell access to both boxes. CCing Kevin Fenzi who can help with that. Boxes which are affected by this bug are openqa-ppc64le-02.qa.fedoraproject.org and openqa-ppc64le-03.qa.fedoraproject.org . Box which is *not* affected is openqa-ppc64le-01.qa.fedoraproject.org .

Comment 5 Adam Williamson 2019-10-01 01:48:39 UTC

oh, slof is in package SLOF. That package is at SLOF-0.1.git20180702-3.fc30.noarch on both affected and unaffected system.

Comment 6 Laurent Vivier 2019-10-01 06:50:56 UTC

Did you check the sha256sum of the iso?

Comment 7 Adam Williamson 2019-10-01 07:00:49 UTC

Laurent: the power8 and power9 boxes are accessing the exact same file, on an NFS share they both have access to. Plus this is happening with multiple ISOs.

Comment 8 Laurent Vivier 2019-10-01 07:18:07 UTC

Could you try with "-M pseries,max-cpu-compat=power8" on the power9 box?

Comment 9 Laurent Vivier 2019-10-01 07:38:34 UTC

And also, what is the result of "lscpu" and "update_flash -d" on the power9 host?

Thanks

Comment 10 Adam Williamson 2019-10-01 14:55:16 UTC

"Could you try with "-M pseries,max-cpu-compat=power8" on the power9 box?"

Same result.

"And also, what is the result of "lscpu" and "update_flash -d" on the power9 host?"

[root@openqa-ppc64le-02 adamwill][PROD]# lscpu
Architecture:         ppc64le
Byte Order:           Little Endian
CPU(s):               128
On-line CPU(s) list:  0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,100,104,108,112,116,120,124
Off-line CPU(s) list: 1-3,5-7,9-11,13-15,17-19,21-23,25-27,29-31,33-35,37-39,41-43,45-47,49-51,53-55,57-59,61-63,65-67,69-71,73-75,77-79,81-83,85-87,89-91,93-95,97-99,101-103,105-107,109-111,113-115,117-119,121-123,125-127
Thread(s) per core:   1
Core(s) per socket:   16
Socket(s):            2
NUMA node(s):         2
Model:                2.2 (pvr 004e 1202)
Model name:           POWER9, altivec supported
CPU max MHz:          3800.0000
CPU min MHz:          2166.0000
L1d cache:            32K
L1i cache:            32K
L2 cache:             512K
L3 cache:             10240K
NUMA node0 CPU(s):    0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60
NUMA node8 CPU(s):    64,68,72,76,80,84,88,92,96,100,104,108,112,116,120,124
[root@openqa-ppc64le-02 adamwill][PROD]# update_flash -d

Firmware version:
 Product Version       : SUPERMICRO-P9DSU-V1.16-20180531-prod
 Product Extra         : bmc-firmware-version-1.27
 Product Extra         : hcode-hw051018a.op920
 Product Extra         : hostboot-f911e5c-pda8239f
 Product Extra         : linux-4.16.7-openpower2-pbc45895
 Product Extra         : machine-xml-218a77a
 Product Extra         : occ-77bb5e6-p623d1cd
 Product Extra         : petitboot-v1.7.1-pf773c0d
 Product Extra         : sbe-8e0105e
 Product Extra         : skiboot-v6.0-p1da203b

Comment 11 Adam Williamson 2019-10-01 14:58:01 UTC

Hum, here's an odd thing - I found *one* image that seems to work: https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20190928.n.2/compose/Server/ppc64le/iso/Fedora-Server-dvd-ppc64le-Rawhide-20190928.n.2.iso . That one boots at least past the CHECK CONDITION errors and tries to start a kernel. Every other one I tried so far hit the CHECK CONDITION errors.

Comment 12 Dan Horák 2019-10-01 15:14:47 UTC

one note to the hw setup - P9 can do virt with SMT enabled (as opposed to P8)

Comment 13 Laurent Vivier 2019-10-01 16:33:52 UTC

(In reply to Adam Williamson from comment #0)
> We just got a couple of new power9 boxes that are intended to be used as
> openQA worker hosts. Unfortunately, they don't seem to be able to boot VMs
> from a scsi-cd 'drive', which is how 90% of openQA tests run. This is what
> happens when I try (from a console, using a command which is basically the
> same as what the openQA worker process does):
> 
> ====
> 
> [root@openqa-ppc64le-02 adamwill][PROD]# /usr/bin/qemu-system-ppc64
> -nographic -global isa-fdc.driveA= -m 4096 -machine usb=off -cpu host
> -netdev user,id=qanet0 -device rtl8139,netdev=qanet0,mac=52:54:00:12:34:56
> -device virtio-scsi-pci,id=scsi0 -device nec-usb-xhci -device usb-tablet
> -device usb-kbd -smp 1 -enable-kvm -no-shutdown -drive
> media=cdrom,if=none,id=cd0,format=raw,file=/var/lib/openqa/share/factory/iso/
> Fedora-Server-dvd-ppc64le-Rawhide-20190930.n.0.iso -device
> scsi-cd,drive=cd0,bus=scsi0.0 -boot once=d,menu=on,splash-time=5000
> 
> 
> SLOF **********************************************************************
> QEMU Starting
>  Build Date = Jan 31 2019 12:46:26
>  FW Version = mockbuild@ release 20180702
>  Press "s" to enter Open Firmware.
> 
> Press F12 for boot menu.
> 
> Populating /vdevice methods
> Populating /vdevice/vty@71000000
> Populating /vdevice/nvram@71000001
> Populating /pci@800000020000000
>                      00 0000 (D) : 1234 1111    qemu vga
>                      00 0800 (D) : 10ec 8139    network [ ethernet ]
>                      00 1000 (D) : 1af4 1004    virtio [ scsi ]
> Populating /pci@800000020000000/scsi@2
>        SCSI: Looking for devices
>           100000000000000 CD-ROM   : "QEMU     QEMU CD-ROM      2.5+"
>                      00 1800 (D) : 1033 0194    serial bus [ usb-xhci ]
> No NVRAM common partition, re-initializing...
> Installing QEMU fb
> 
> 
> 
> Scanning USB 
>   XHCI: Initializing
>     USB Keyboard 
> No console specified using screen & keyboard
>      
> 
> 
>   Welcome to Open Firmware
> 
>   Copyright (c) 2004, 2017 IBM Corporation All rights reserved.
>   This program and the accompanying materials are made available
>   under the terms of the BSD License available at
>   http://www.opensource.org/licenses/bsd-license.php
> 
> 
> Trying to load:  from: /pci@800000020000000/scsi@2/disk@100000000000000 ... 
> Successfully loaded
> SCSI-DISK: /pci@800000020000000/scsi@2/disk@100000000000000:0,read-blocks
> failed
> SCSI-DISK: Status 2 [CHECK CONDITION] Sense b [ABORTED COMMAND] ASC 0 ASCQ 6 
> SCSI-DISK: /pci@800000020000000/scsi@2/disk@100000000000000:0,read-blocks
> failed
> SCSI-DISK: Status 2 [CHECK CONDITION] Sense b [ABORTED COMMAND] ASC 0 ASCQ 6 
> SCSI-DISK: /pci@800000020000000/scsi@2/disk@100000000000000:0,read-blocks
> failed
> SCSI-DISK: Status 2 [CHECK CONDITION] Sense b [ABORTED COMMAND] ASC 0 ASCQ 6 
> SCSI-DISK: /pci@800000020000000/scsi@2/disk@100000000000000:0,read-blocks
> failed
> SCSI-DISK: Status 2 [CHECK CONDITION] Sense b [ABORTED COMMAND] ASC 0 ASCQ 6 
> 

If you add "-vga none" to the command line we will be able to see more logs as for the moment they are sent to VGA console.

Comment 14 Adam Williamson 2019-10-01 23:09:43 UTC

Booting that way displays the grub menu, but when I pick any boot choice on it, the 'ABORTED COMMAND' errors are shown very briefly and then the grub menu comes back. I don't *think* any additional information appears.

Comment 15 Laurent Vivier 2019-10-02 09:37:08 UTC

I've tested on the same kind of system:

Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  4
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9, altivec supported
CPU max MHz:         3800.0000
CPU min MHz:         2166.0000
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-63
NUMA node8 CPU(s):   64-127

Firmware version:
 Product Version       : SUPERMICRO-P9DSU-V1.16-20180531-imp
 Product Extra         : bmc-firmware-version-1.23
 Product Extra         : hcode-hw051018a.op920
 Product Extra         : hostboot-f911e5c-pda8239f
 Product Extra         : linux-4.16.7-openpower2-pbc45895
 Product Extra         : machine-xml-218a77a
 Product Extra         : occ-77bb5e6-p623d1cd
 Product Extra         : petitboot-v1.7.1-pf773c0d
 Product Extra         : sbe-8e0105e
 Product Extra         : skiboot-v6.0-p1da203b

Only the bmc-firmware-version differs.

qemu-system-ppc-3.1.1-2.fc30.ppc64le
SLOF-0.1.git20180702-3.fc30.noarch
kernel-5.2.17-200.fc30.ppc64le

Image:

https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20190930.n.0/compose/Server/ppc64le/iso/Fedora-Server-dvd-ppc64le-Rawhide-20190930.n.0.iso
sha256: 62f9d2fbed4e4e3ff7966ea8c6818d5f669a116b7845af007378c397b237b739

I've put the iso on an NFS server and started the same command:

/usr/bin/qemu-system-ppc64 -nographic -global isa-fdc.driveA= -m 4096 -machine usb=off -cpu host -netdev user,id=qanet0 -device rtl8139,netdev=qanet0,mac=52:54:00:12:34:56 -device virtio-scsi-pci,id=scsi0 -device nec-usb-xhci -device usb-tablet -device usb-kbd -smp 1 -enable-kvm -no-shutdown -drive media=cdrom,if=none,id=cd0,format=raw,file=/local_dir/Fedora-Server-dvd-ppc64le-Rawhide-20190930.n.0.iso -device scsi-cd,drive=cd0,bus=scsi0.0 -boot once=d,menu=on,splash-time=5000 -vga none

But it works fine for me.

Comment 16 Adam Williamson 2019-10-02 15:04:38 UTC

Thanks a lot for testing! Well, that's quite bizarre.

Here's another bizarre thing: I just tried it again with that image, and it worked. Worked twice in a row. But it's still failing with other images; right now I have these:

Fedora-Server-dvd-ppc64le-Rawhide-20190930.n.0.iso - WORKS
Fedora-Server-dvd-ppc64le-Rawhide-20190928.n.2.iso - gets halfway through boot, then FAILS
Fedora-Server-dvd-ppc64le-Rawhide-20191001.n.1.iso - FAILS
Fedora-Server-netinst-ppc64le-31-20190926.n.0.iso - FAILS
Fedora-Server-netinst-ppc64le-Rawhide-20191001.n.1.iso - gets halfway through boot, then FAILS
Fedora-Server-netinst-ppc64le-Rawhide-20190928.n.2.iso 0 FAILS
Fedora-Server-netinst-ppc64le-31-20190926.n.0.iso - FAILS

can you try a few different ISOs your side and see what happens?

Comment 17 Laurent Vivier 2019-10-02 17:13:08 UTC

Fedora-Server-dvd-ppc64le-Rawhide-20190930.n.0.iso - WORKS
Fedora-Server-dvd-ppc64le-Rawhide-20190928.n.2.iso - WORKS
Fedora-Server-dvd-ppc64le-Rawhide-20191001.n.1.iso - WORKS
Fedora-Server-netinst-ppc64le-31_Beta-1.1.iso      - WORKS
...

I suspect you have problem with your NFS connection.

If you have enough bandwidth, you can try to launch the VM directly with the https URL:

... -drive media=cdrom,if=none,id=cd0,format=raw,file=https://kojipkgs.fedoraproject.org/compose/31/Fedora-31-20190911.0/compose/Server/ppc64le/iso/Fedora-Server-netinst-ppc64le-31_Beta-1.1.iso ...

Comment 18 Adam Williamson 2019-10-02 17:53:51 UTC

Like I said: the other worker host has no problems, using the same NFS share. I'll try retrieving some locally and booting them from there, though.

Comment 19 Adam Williamson 2019-10-02 23:08:40 UTC

Huh, so I think you're right to suspect the NFS share. I tested downloading one of the ISOs locally and it always boots OK, but booting it off the NFS share fails.

However, this is still odd. First, as I said, the power8 host has no issue, using the same share. And also:

[root@openqa-ppc64le-02 adamwill][PROD]# sha256sum /var/lib/openqa/share/factory/iso/Fedora-Server-netinst-ppc64le-31-
20190926.n.0.iso
70f69cdfe8f49c825a9ed1264c6907965535d8a7dd581064e93d6dfbaca5c61e  /var/lib/openqa/share/factory/iso/Fedora-Server-neti
nst-ppc64le-31-20190926.n.0.iso
[root@openqa-ppc64le-02 adamwill][PROD]# sha256sum ./Fedora-Server-netinst-ppc64le-31-20190926.n.0.iso 
70f69cdfe8f49c825a9ed1264c6907965535d8a7dd581064e93d6dfbaca5c61e  ./Fedora-Server-netinst-ppc64le-31-20190926.n.0.iso

I get the same sha256sum for the local copy and the copy on the NFS share. Yet somehow accessing the file *via qemu* over the NFS share seems to go wrong?

Comment 20 Adam Williamson 2019-10-02 23:15:23 UTC

There is something in dmesg on one of the affected power9 hosts:

[270546.745301] Oops: Kernel access of bad area, sig: 11 [#1]
[270546.745329] Faulting instruction address: 0xc00800000f2cafa8
[270546.745412] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=1024 NUMA PowerNV
[270546.745428] Modules linked in: kvm_hv kvm rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache b
infmt_misc ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 xt_state xt_conntrack nf_co
nntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables sunrpc vmx_crypto at24 regmap_i2c ofpart powernv_flash 
ses i40e enclosure ipmi_powernv ipmi_devintf scsi_transport_sas opal_prd mtd ipmi_msghandler i2c_opal crct10dif_vpmsum
 joydev rtc_opal xfs raid456 async_raid6_recov async_memcpy async_pq async_xor raid1 xor async_tx raid6_pq libcrc32c a
st i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm megaraid_sas crc32c_vpmsum aacrai
d drm_panel_orientation_quirks i2c_core
[270546.745569] CPU: 24 PID: 5481 Comm: kworker/u257:3 Not tainted 5.2.17-200.fc30.ppc64le #1
[270546.745625] Workqueue: rpciod rpc_async_schedule [sunrpc]
[270546.745638] NIP:  c00800000f2cafa8 LR: c00800000f2caf60 CTR: c00800000f2caee8
[270546.745654] REGS: c0000015e6a67880 TRAP: 0380   Not tainted  (5.2.17-200.fc30.ppc64le)
[270546.745668] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 22004442  XER: 00000000
[270546.745702] CFAR: c00800000f2caf64 IRQMASK: 0 
                GPR00: c00800000f2caf60 c0000015e6a67b10 c00800000f378000 0000000000000000 
                GPR04: 0000000000000001 0000000000000000 0000000400000000 c0000016e06a8600 
                GPR08: 0000000000000001 0000000000000000 c00800000f320530 c0000000016f9cf8 
                GPR12: c00800000f2caee8 c000001ffffe4000 c000000000158168 c000201cb917f000 
                GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
                GPR20: c00000000159db58 0000000000000000 0000000000000040 fffffffffffffe00 
                GPR24: 0000000000000000 0000000000000001 c00800000e194784 c00000146a696b30 
                GPR28: c000201c58b1e400 0000000000000004 c00000146a696b00 c000201c5e42d000 
[270546.745855] NIP [c00800000f2cafa8] nfs4_open_prepare+0xc0/0x2d0 [nfsv4]
[270546.745880] LR [c00800000f2caf60] nfs4_open_prepare+0x78/0x2d0 [nfsv4]
[270546.745893] Call Trace:
[270546.745901] [c0000015e6a67b10] [c0000015e6a67c18] 0xc0000015e6a67c18 (unreliable)
[270546.745931] [c0000015e6a67b50] [c00800000e10315c] rpc_prepare_task+0x34/0x50 [sunrpc]
[270546.745960] [c0000015e6a67b70] [c00800000e10ddf8] __rpc_execute+0xd0/0x580 [sunrpc]
[270546.745988] [c0000015e6a67c50] [c00800000e10e2ec] rpc_async_schedule+0x44/0x80 [sunrpc]
[270546.746007] [c0000015e6a67c80] [c00000000014e8fc] process_one_work+0x26c/0x520
[270546.746024] [c0000015e6a67d20] [c00000000014ec38] worker_thread+0x88/0x5c0
[270546.746040] [c0000015e6a67db0] [c0000000001582b4] kthread+0x154/0x1a0
[270546.746056] [c0000015e6a67e20] [c00000000000c1cc] ret_from_kernel_thread+0x5c/0x70
[270546.746071] Instruction dump:
[270546.746081] 7fc3f378 e8010010 eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8 7c0803a6 4bffdc7c 
[270546.746101] 60000000 60000000 e93f0380 e9290038 <e869ffb0> 2c230000 41820030 815f002c 
[270546.746122] ---[ end trace 94d0409b1ba0be04 ]---

Sure looks a bit suspicious...but I don't see it on the *other* power9 host, but that one fails the same. Hmm.

Comment 21 Adam Williamson 2019-10-02 23:44:09 UTC

Hey, hopeful news, though - upgrades always make everything better, right? So I kicked both power9 boxes to kernel 5.3.2-300.fc30.ppc64le , rebooted, and tried again. They seem to be behaving much better now. I've activated them in openQA again and I'm running a test set of jobs to see how they do.

Comment 22 Adam Williamson 2019-10-03 22:06:42 UTC

So, yeah, *this* bug seems like it was some kinda kernel issue. It seems like it exists in 5.2.17-200.fc30 but not 5.2.9-200.fc30 (which is what the power8 box was one) or 5.3.2-300.fc30.

However, sadly, I'm still having issues: although the tests now seem to always start correctly on the new boxes, they frequently suddenly die partway through with no obvious explanation showing up in any logs I can find. That's gonna be fun to debug. Let's close this one for now, though.

Comment 23 Adam Williamson 2019-10-03 22:08:01 UTC

*** Bug 1757250 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

airlied
amit
berrange
bskeggs
cfergeau
crosa
dan
dgibson
dwmw2
hdegoede
ichavero
itamar
jarodwilson
jeremy
jglisse
john.j5live
jonathan
josef
kernel-maint
kevin
linville
lvivier
masami256
mchehab
mjg59
normand
pbonzini
rjones
steved
virt-maint