Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1701078 - [Fedora29][aarch64][Gigabyte][r270] Internal error: Oops: 96000004 [#1] SMP
Summary: [Fedora29][aarch64][Gigabyte][r270] Internal error: Oops: 96000004 [#1] SMP
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 29
Hardware: aarch64
OS: All
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: ARMTracker
TreeView+ depends on / blocked
 
Reported: 2019-04-18 01:03 UTC by PaulB
Modified: 2019-12-24 10:54 UTC (History)
31 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-08 18:20:29 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description PaulB 2019-04-18 01:03:14 UTC
1. Please describe the problem:
aarch64 Gigabyte r270 systems with latest firmware (T49) fail to install
Fedora29 due to:
Internal error: Oops: 96000004 [#1] SMP 

2. What is the Version-Release number of the kernel:
distro: Fedora-29 Everything aarch64
kernel: 4.18.5-300.fc29.aarch64
anaconda: 29.24.3-1.fc29

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :
unknown - I will tests Fedora28 and follow up.


4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:
yes - this fails consistently

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:
unknown - I will test Rawhide and follow up.


6. Are you running any modules that not shipped with directly Fedora's kernel?:
no

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.
---------------
see issue here:
---------------
distro: Fedora-29 Everything aarch64
kernel: 4.18.5-300.fc29.aarch64
anaconda: 29.24.3-1.fc29
https://beaker.engineering.redhat.com/jobs/3476180
http://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2019/04/34761/3476180/6744468/console.log
---<-snip->---
   
[anaconda]1:main* 2:shell  3:log  4:storage-log >Switch tab: Alt+Tab | Help: F1   
 (B     
Starting installer, one moment... 
[   77.477558] Internal error: Oops: 96000004 [#1] SMP 
[   77.482437] Modules linked in: scsi_dh_rdac scsi_dh_emc scsi_dsysfillrect sysimgblt fb_sys_fops ghash_ce drm gpio_keys thunder_bgx thunderx_zip mdio_thunder thunder_xcv mdio_cavium of_mdio i2c_thunderx fixed_phy libphy thunderx_mmc sunrpc lrw dm_crypt dm_round_robin dm_multipath linear raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi squashfs zstd_decompress xxhash cramfs 
[   77.543857] CPU: 79 PID: 2657 Comm: NetworkManager Tainted: G        W         4.18.5-300.fc29.aarch64 #1 
[   77.553410] Hardware name: GIGABYTE R270-T60-00/MT60-SC0-00, BIOS T49 02/02/2018 
[   77.560799] pstate: 20400005 (nzCv daif +PAN -UAO) 
[   77.565593] pc : find_next_and_bit+0xc/0x70 
[   77.569769] lr : cpumask_local_spread+0x80/0x158 
[   77.574375] sp : ffff00002fe7b260 
[   77.577678] x29: ffff00002fe7b260 x28: 0000000000000011  
[   77.582981] x27: 0000000000000000 x26: 00000000006000c0  
[   77.588286] x25: ffff8107a20570b5 x24: 0000000000000000  
[   77.593590] x23: ffff000009854a38 x22: ffff00000966af58  
[   77.598894] x21: ffff00000966b1c4 x20: 0000000000000001  
[   77.604196] x19: 0000000000000001 x18: 00000000fffffffc  
[   77.609499] x17: 0000000000000000 x16: 0000000000000000  
[   77.614802] x15: 0000000000000001 x14: ffffffffffffffff  
[   77.620104] x13: ffff000000000000 x12: 0000000000000028  
[   77.625407] x11: 0101010101010101 x10: ffffff7fff7f7f7f  
[   77.630709] x9 : 0000000000000000 x8 : ffff81079a546880  
[   77.636012] x7 : 0000000000000000 x6 : 000000000000007f  
[   77.641314] x5 : 0000000000000080 x4 : 0000000000000000  
[   77.646617] x3 : 0000000000000000 x2 : 0000000000000060  
[   77.651920] x1 : ffff00000966af58 x0 : 0000000000000000  
[   77.657226] Process NetworkManager (pid: 2657, stack limit = 0x0000000024edb7a9) 
[   77.664611] Call trace: 
[   77.667049]  find_next_and_bit+0xc/0x70 
[   77.670887]  nicvf_open+0x81c/0x958 [nicvf] 
[   77.675066]  __dev_open+0xd4/0x170 
[   77.678458]  __dev_change_flags+0x168/0x1c8 
[   77.682630]  dev_change_flags+0x34/0x70 
[   77.686461]  do_setlink+0x270/0xb98 
[   77.689939]  rtnl_newlink+0x3d0/0x700 
[   77.693590]  rtnetlink_rcv_msg+0x20c/0x2c0 
[   77.697679]  netlink_rcv_skb+0x40/0xf8 
[   77.701418]  rtnetlink_rcv+0x28/0x38 
[   77.704983]  netlink_unicast+0x1a4/0x278 
[   77.708894]  netlink_sendmsg+0x1a0/0x350 
[   77.712809]  sock_sendmsg+0x4c/0x68 
[   77.716287]  ___sys_sendmsg+0x230/0x260 
[   77.720112]  __sys_sendmsg+0x54/0x98 
[   77.723677]  sys_sendmsg+0x38/0x48 
[   77.727073]  __sys_trace_return+0x0/0x4 
[   77.730901] Code: d65f03c0 eb03005f 54000329 d346fc64 (f8647806)  
[   77.736987] ---[ end trace c6a1706aff09b047 ]--- 
---<-snip->---

Best,
-pbunyan

Comment 1 PaulB 2019-04-18 01:17:17 UTC
All,
---------------------
Here is a reproducer:
---------------------
 distro: Fedora-29 Everything aarch64
 kernel: 4.18.5-300.fc29.aarch64
 anaconda: 29.24.3-1.fc29
 https://beaker.engineering.redhat.com/jobs/3476413

Note systems that reproduce this issue are gigabyte-r270:
 GIGABYTE R270  BIOS T49 02/02/2018 


fwiw... the gigabyte-r120 systems with BIOS T49 install Fedora29 without issue:
 distro: Fedora-29 Everything aarch64
 kernel: 4.18.5-300.fc29.aarch64
 anaconda: 29.24.3-1.fc29
 host: gigabyte-r120 
 bios: BIOS T49 02/02/2018 
 https://beaker.engineering.redhat.com/jobs/3476078 - PASS
 https://beaker.engineering.redhat.com/jobs/3476079 - PASS



Best,
-pbunyan

Comment 2 PaulB 2019-04-24 02:14:19 UTC
All,
------------------------------------
Answering the outstanding questions:
------------------------------------

https://bugzilla.redhat.com/show_bug.cgi?id=1701078#c0
---<-snip->---
3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :
unknown - I will tests Fedora28 and follow up.

yes - this issue is reproduced with Fedora28.
see here - https://beaker.engineering.redhat.com/jobs/3482697 - FAIL




https://bugzilla.redhat.com/show_bug.cgi?id=1701078#c0
---<-snip->---
4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:
yes - this fails consistently

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:
unknown - I will test Rawhide and follow up.

yes - Fedora-Rawhide-20190417.n.0 Everything aarch64 also fails.
      Though, the failure is different.
see here: https://beaker.engineering.redhat.com/jobs/3484673
http://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2019/04/34846/3484673/6760278/console.log
---<-snip->---
[   59.756861] ---[ end trace 9061ffef8a40d3d7 ]--- 
[   59.761519] WARNING: CPU: 7 PID: 1669 at arch/arm64/mm/numa.c:60 cpumask_of_node+0x44/0x70 
[   59.769778] Modules linked in: vfat fat nicvf cavium_ptp cavium_rng_vf crct10dif_ce ghash_ce nicpf joydev mdio_thunder thunder_bgx mdio_cavium thunderx_zip thunder_xcv thunderx_edac cavium_rng ipmi_ssif ipmi_devintf ipmi_msghandler ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm gpio_keys i2c_thunderx thunderx_mmc 
[   59.800902] CPU: 7 PID: 1669 Comm: NetworkManager Tainted: G        W         5.1.0-0.rc5.git1.1.fc31.aarch64 #1 
[   59.811066] Hardware name: GIGABYTE R270-T65-00/MT60-SC5-00, BIOS T49 02/02/2018 
[   59.818454] pstate: 60400005 (nZCv daif +PAN -UAO) 
[   59.823238] pc : cpumask_of_node+0x44/0x70 
[   59.827327] lr : cpumask_local_spread+0xb8/0x160 
[   59.831935] sp : ffff00001d76b340 
[   59.835241] x29: ffff00001d76b340 x28: ffff81077e9ed44d  
[   59.840546] x27: 0000000000000000 x26: 0000000000000001  
[   59.845851] x25: ffff000011845374 x24: 0000000000000000  
[   59.851156] x23: ffff000011845374 x22: ffff000011845098  
[   59.856460] x21: 0000000000000001 x20: 0000000000000001  
[   59.861765] x19: 0000000000000001 x18: 00000000fffffffc  
[   59.867070] x17: 0000000000000000 x16: 0000000000000000  
[   59.872374] x15: 0000000000000001 x14: ffffffffffffffff  
[   59.877678] x13: ffff000000000000 x12: 0000000000000028  
[   59.882983] x11: 0101010101010101 x10: ffff7f7f7f7fff7f  
[   59.888288] x9 : 0000000000000000 x8 : ffff8107a938e180  
[   59.893592] x7 : 0000000000000000 x6 : 0000000000000000  
[   59.898897] x5 : 0000000000000080 x4 : ffffffffffffffff  
[   59.904201] x3 : 0000000000000000 x2 : 0000000000000000  
[   59.909506] x1 : 0000000000000060 x0 : 0000000000000001  
[   59.914810] Call trace: 
[   59.917249]  cpumask_of_node+0x44/0x70 
[   59.920991]  cpumask_local_spread+0xb8/0x160 
[   59.925257]  nicvf_register_interrupts+0x324/0x388 [nicvf] 
[   59.930737]  nicvf_open+0x2a8/0x6f8 [nicvf] 
[   59.934913]  __dev_open+0xdc/0x178 
[   59.938307]  __dev_change_flags+0x170/0x1c8 
[   59.942482]  dev_change_flags+0x3c/0x78 
[   59.946311]  do_setlink+0x7c8/0x9a0 
[   59.949792]  __rtnl_newlink+0x590/0x6a8 
[   59.953620]  rtnl_newlink+0x54/0x80 
[   59.957101]  rtnetlink_rcv_msg+0x184/0x538 
[   59.961189]  netlink_rcv_skb+0x40/0xf8 
[   59.964930]  rtnetlink_rcv+0x28/0x38 
[   59.968497]  netlink_unicast+0x15c/0x1d0 
[   59.972412]  netlink_sendmsg+0x1b0/0x350 
[   59.976326]  sock_sendmsg+0x4c/0x68 
[   59.979807]  ___sys_sendmsg+0x288/0x2b8 
[   59.983634]  __sys_sendmsg+0x64/0xa0 
[   59.987202]  __arm64_sys_sendmsg+0x2c/0x38 
[   59.991291]  el0_svc_common+0x78/0x128 
[   59.995032]  el0_svc_handler+0x38/0x78 
[   59.998773]  el0_svc+0x8/0xc 
[   60.001646] irq event stamp: 0 
[   60.004692] hardirqs last  enabled at (0): [<0000000000000000>]           (null) 
[   60.012080] hardirqs last disabled at (0): [<ffff0000100f97c4>] copy_process.isra.0.part.0+0x304/0x1500 
[   60.021464] softirqs last  enabled at (0): [<ffff0000100f97c4>] copy_process.isra.0.part.0+0x304/0x1500 
[   60.030852] softirqs last disabled at (0): [<0000000000000000>]           (null) 
[   60.038246] ---[ end trace 9061ffef8a40d3d8 ]---
---<-snip->---


Best,
-pbunyan

Comment 3 PaulB 2019-04-24 02:15:43 UTC
pwhalen,
What's the process for nursing an aarch64 Fedora BZ along?
Who assigns the Fedora BZ?

Thank you, Paul.

Best,
-pbunyan

Comment 4 Peter Robinson 2019-04-24 14:03:04 UTC
Do you have any SRV-IO VF (Virtual function) or similar functionality enabled in the bios?

Comment 5 Paul Whalen 2019-04-24 17:03:50 UTC
(In reply to PaulB from comment #3)
> pwhalen,
> What's the process for nursing an aarch64 Fedora BZ along?
> Who assigns the Fedora BZ?

It should get picked up by one of the maintainers. Added to the ARM Tracker and I'll keep an eye on it.

Comment 6 PaulB 2019-04-24 20:26:50 UTC
(In reply to Paul Whalen from comment #5)
> (In reply to PaulB from comment #3)
> > pwhalen,
> > What's the process for nursing an aarch64 Fedora BZ along?
> > Who assigns the Fedora BZ?
> 
> It should get picked up by one of the maintainers. Added to the ARM Tracker
> and I'll keep an eye on it.

Paul,
Please add the Fedora maintainer to the cc list so the BZ is on their radar.

Thank you.
Best,
-pbunyan

Comment 7 Jeremy Linton 2019-04-24 21:31:15 UTC
I think the VF's must be enabled, as that nicvf_main only gets triggered if VF's are found.

But in the 5.1 crash I suspect there is an error in the SRAT/SLIT/DSDT (looking closer, maybe the nic is trying to set the node, and no such node exists) which means the node request likely isn't valid. If a couple prints are sprinked around nicvf_set_irq_affinity() and compared with the ACPI node information, im guesing you will see a mismatch.

Comment 8 Jeremy Linton 2019-04-24 21:32:54 UTC
oh, just to complete the install, try `modprobe.blacklist=nicvf` on the kernel command line.

Comment 9 Jeremy Poulin 2019-04-30 19:52:54 UTC
Setting `modprobe.blacklist=nicvf` appears to result in an dracut initqueue timeout and drops you out into the Dracut emergency shell.
https://beaker.engineering.redhat.com/jobs/3507807

That said, this could be an issue with the underlying gigabyte r270 that I'm using.
I've queued a copy of Paul initial recipe on my host to verify that I can reproduce the original issue:
https://beaker.engineering.redhat.com/jobs/3507865

I will update this issue with links to the console logs as soon as the jobs complete.

Comment 10 Jeremy Poulin 2019-04-30 20:42:22 UTC
Results confirm my findings:
w/ modprobe.blacklist.nicvf -> Dracut initqueue timeout
http://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2019/04/35078/3507807/6807486/console.log

w/out modprobe.blacklist.nicvf -> Panic reproduced
http://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2019/04/35078/3507865/6807589/console.log

Comment 11 Jeremy Poulin 2019-05-01 14:00:33 UTC
Was there anything further we can test on our side to assist with the debugging process?

Comment 12 Peter Robinson 2019-05-01 15:48:40 UTC
Can you clarify what releases/kernels you are testing on?

It would likely be useful to re-test on F-30 GA release.

Can you also clarify whether there has been changes in the firmware around NICs/SRV IO etc.

Comment 13 PaulB 2019-05-01 17:15:35 UTC
(In reply to Peter Robinson from comment #12)
> Can you clarify what releases/kernels you are testing on?

Questions have already been answered in the previous comments of this BZ:
-----------------------------------
Fedora29 [4.18.5-300.fc29.aarch64]:
-----------------------------------
https://bugzilla.redhat.com/show_bug.cgi?id=1701078#c0
see Beaker job: https://beaker.engineering.redhat.com/jobs/3476180

-----------------------------------
Fedora28 [4.16.3-301.fc28.aarch64]:
-----------------------------------
https://bugzilla.redhat.com/show_bug.cgi?id=1701078#c2
see Beaker job: https://beaker.engineering.redhat.com/jobs/3482697

--------------------------------------------------------------
Fedora-Rawhide-20190417.n.0 [5.1.0-0.rc5.git1.1.fc31.aarch64]: 
--------------------------------------------------------------
https://bugzilla.redhat.com/show_bug.cgi?id=1701078#c2
see Beaker job: https://beaker.engineering.redhat.com/jobs/3484673

> 
> It would likely be useful to re-test on F-30 GA release.
> 
Jeremy Poulin <jpoulin>, please retest F-30 GA release, for Peter.

> Can you also clarify whether there has been changes in the firmware around
> NICs/SRV IO etc.
Firmware version T49 has been around for sometime.
There have been no recent changes.

Best,
-pbunyan

Comment 14 Jeremy Poulin 2019-05-01 17:36:08 UTC
To Summarize the Tests I ran
============================
F29 with flag
https://beaker.engineering.redhat.com/jobs/3507807 -> http://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2019/04/35078/3507807/6807486/console.log
F29 without flag
https://beaker.engineering.redhat.com/jobs/3507865 -> http://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2019/04/35078/3507865/6807589/console.log

Both my tests were run with the following:
distro: Fedora-29 Everything aarch64
kernel: 4.18.5-300.fc29.aarch64

The test that panicked (w/out modprobe.blacklist=nicvf) go to anaconda, and was using version:
anaconda: 29.24.3-1.fc29

The test that included modprobe.blacklist=nicvf never reached the anaconda step, else I believe it would use the same version.

BIOS Date: 02/02/2018 14:11:01 Ver: T49
(This is the latest firmware to my knowledge - https://www.gigabyte.com/us/ARM-Server/R270-T65-rev-100#support-dl-bios).

> Can you also clarify whether there has been changes in the firmware around NICs/SRV IO etc.
I defer to Paul's answer on this.

I will run the same jobs targeting F30 GA.

Comment 15 Jeremy Poulin 2019-05-01 19:11:42 UTC
F30 Results
===========
F30 with flag
https://beaker.engineering.redhat.com/jobs/3509182 -> https://beaker.engineering.redhat.com/recipes/6810199/logs/console.log
This times out in the initqueue step just like it had for for F29.

F30 with flag
https://beaker.engineering.redhat.com/jobs/3509076 -> http://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2019/05/35090/3509076/6809994/console.log
The panic is still present.

distro: Fedora-30 Everything aarch64
kernel: 5.0.9-301.fc30.aarch64
anaconda: 30.25.6-2.fc30
BIOS Date: 02/02/2018 14:11:01 Ver: T49

Comment 16 Peter Robinson 2019-05-01 22:37:36 UTC
(In reply to PaulB from comment #13)
> (In reply to Peter Robinson from comment #12)
> > Can you clarify what releases/kernels you are testing on?
> 
> Questions have already been answered in the previous comments of this BZ:

Actually no they weren't, there was no previous mention of F-28 in this bug at all. Please be a little bit more friendly if you actually want this dealt with!

> -----------------------------------
> Fedora29 [4.18.5-300.fc29.aarch64]:
> -----------------------------------
> https://bugzilla.redhat.com/show_bug.cgi?id=1701078#c0
> see Beaker job: https://beaker.engineering.redhat.com/jobs/3476180

This is a Fedora bug, beaker isn't publicly available to every one that may be replying to this bug so the references in Fedora are invalid. If the comments are private they don't exist in the Fedora community space and I don't see any missing comment numbers so I don't believe that's the case.

> -----------------------------------
> Fedora28 [4.16.3-301.fc28.aarch64]:
> -----------------------------------
> https://bugzilla.redhat.com/show_bug.cgi?id=1701078#c2
> see Beaker job: https://beaker.engineering.redhat.com/jobs/3482697
> 
> --------------------------------------------------------------
> Fedora-Rawhide-20190417.n.0 [5.1.0-0.rc5.git1.1.fc31.aarch64]: 
> --------------------------------------------------------------
> https://bugzilla.redhat.com/show_bug.cgi?id=1701078#c2
> see Beaker job: https://beaker.engineering.redhat.com/jobs/3484673
> 
> > 
> > It would likely be useful to re-test on F-30 GA release.
> > 
> Jeremy Poulin <jpoulin>, please retest F-30 GA release, for Peter.
> 
> > Can you also clarify whether there has been changes in the firmware around
> > NICs/SRV IO etc.
> Firmware version T49 has been around for sometime.
> There have been no recent changes.

That does not answer my question. To re word it. Has there been any specific configuration of NFV or related changes with in the firmware configuration. Can you reset the firmware to default settings. We've had numerous ThunderX systems (both X1 and X2) confirmed running without issues on Fedora all recent versions of Fedora.

The only relatively recent issue we've had was an issue with their crypto drivers and that was some time ago and it was fixed in time for the GA release (28 I think from memory) so this issue is something specific to this system hence the questions, I am trying to ascertain what is different.

Comment 17 Jeremy Poulin 2019-05-02 18:56:32 UTC
> That does not answer my question. To re word it. Has there been any specific
> configuration of NFV or related changes with in the firmware configuration.
> Can you reset the firmware to default settings. We've had numerous ThunderX
> systems (both X1 and X2) confirmed running without issues on Fedora all
> recent versions of Fedora.
> 
> The only relatively recent issue we've had was an issue with their crypto
> drivers and that was some time ago and it was fixed in time for the GA
> release (28 I think from memory) so this issue is something specific to this
> system hence the questions, I am trying to ascertain what is different.

We checked the host in question and there were no configuration files related to NFV, kvm, or anything we thought would be related. We are in the process of resetting the firmware back to default settings, and I'm going to re-run the jobs for F29 to determine if the issues persist with default settings. Do you need me to test anything outside of F29 on the fresh host?

Thanks!

Comment 18 Peter Robinson 2019-05-02 19:07:05 UTC
F-30 would be good, in Fedora we never re-spin the installers so ultimately we need to look forward to F-30+

Comment 19 Jeremy Poulin 2019-05-02 19:08:48 UTC
Good to know. I will include results for F-30. :)

Some additional information since I'm not sure what might be relevant:

== Platform Information ==
Manufacturer: Cavium
Product Name: ThunderX CRB
BIOS Version: T49
BIOS Release Date: 02/02/2018

== Firmware Information ==
Product Name: MergePoint EMS
Product Information: MergePoint Embedded Management Software
Firmware Version: 7.70
Firmware Updated: 06 Oct 2016, 19:13:24 (UTC+0000)
ASIC Type: ast2400

== CPLD Information ==
MB CPLD Version: R06
BPB CPLD Version: R03

Additionally, there didn't seem to be any NFV related options for configuring in the firmware configuration options.

Comment 20 Jeremy Linton (ARM) 2019-05-02 19:53:39 UTC
So, it may be the actual difference between the machines is a variation in the thunderX model. The problematic one from the log is a CN8890-2000BG2601-ST-Y-G, AKA it has a bunch of extra accelerators that aren't part of the normal CP model. It also looks like the machine is booting in DT mode, which AFAIK, is not really optimal as this is an enterprise platform.

You might add `acpi=force` or assure that the firmware is running in ACPI mode.

Looking at the log (usually I too tend to keep fedora defects "community" by using a public id), it seems there are a number of firmware problems with node and IOMMU Ids:

[   30.985300] Failed to set up IOMMU for device 0000:01:01.4; retaining platform DMA ops 
[   30.993496] thunderx_mmc: probe of 0000:01:01.4 failed with error -2 
[   30.993759] Failed to set up IOMMU for device 0000:01:01.3; retaining platform DMA ops 
[   30.999979] Failed to set up IOMMU for device 0004:01:01.4; retaining platform DMA ops 
[   31.008101] libphy: mdio_thunder: probed 
[   31.015939] thunderx_mmc: probe of 0004:01:01.4 failed with error -2 
[   31.016588] thunder_xcv, ver 1.0 
[   31.016707] Failed to set up IOMMU for device 0000:01:09.2; retaining platform DMA ops 
[   31.020400] mdio_thunder 0000:01:01.3: Added bus at 87e005003800 
[   31.030270] thunder_bgx, ver 1.0 
[   31.030388] Failed to set up IOMMU for device 0000:03:00.0; retaining platform DMA ops 
[   31.037402] libphy: mdio_thunder: probed 
[   31.043258] Failed to set up IOMMU for device 0000:01:10.0; retaining platform DMA ops 
[   31.046642] mdio_thunder 0000:01:01.3: Added bus at 87e005003880 
[   31.072434] Failed to set up IOMMU for device 0004:01:01.3; retaining platform DMA ops 
[   31.072949] i2c-thunderx 0000:01:09.2: Probed. Set system clock to 800000000 
[   31.073916] input: soc@0:gpio-keys as /devices/platform/soc@0/soc@0:gpio-keys/input/input3 
[   31.083501] libphy: mdio_thunder: probed 
[   31.092988] i2c-thunderx 0000:01:09.2: SMBUS alert not active on this bus 
[   31.101645] mdio_thunder 0004:01:01.3: Added bus at 97e005003800 
[   31.105526] ThunderX-ZIP 0000:03:00.0: Found ZIP device 0 177d:a01a on Node 0 
[   31.105603] Failed to set up IOMMU for device 0000:01:09.4; retaining platform DMA ops 
[   31.112374] libphy: mdio_thunder: probed 
[   31.118482] thunder_bgx 0000:01:10.0: BGX0 QLM mode: XFI 
[   31.118552] Failed to set up IOMMU for device 0004:03:00.0; retaining platform DMA ops 
[   31.118570] ThunderX-ZIP 0004:03:00.0: Found ZIP device 1 177d:a01a on Node -1 
[   31.123141] alg: No test for lzs (lzs-cavium) 
[   31.125549] mdio_thunder 0004:01:01.3: Added bus at 97e005003880 
[   31.133824] i2c-thunderx 0000:01:09.4: Probed. Set system clock to 800000000 
[   31.141709] alg: No test for lzs (lzs-scomp-cavium) 
[   31.142626] i2c-thunderx 0000:01:09.4: SMBUS alert not active on this bus 
[   31.144143] audit: type=1130 audit(1556737441.150:7): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-udev-trigger comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' 
[   31.207445] Failed to set up IOMMU for device 0004:01:09.4; retaining platform DMA ops 
[   31.208088] Failed to set up IOMMU for device 0000:01:10.1; retaining platform DMA ops 
[   31.216015] i2c-thunderx 0004:01:09.4: Probed. Set system clock to 800000000 
[   31.223415] thunder_bgx 0000:01:10.1: BGX1 QLM mode: XLAUI 
[   31.230429] i2c-thunderx 0004:01:09.4: SMBUS alert not active on this bus 
[   31.236398] Failed to set up IOMMU for device 0004:01:10.0; retaining platform DMA ops 
[   31.250801] thunder_bgx 0004:01:10.0: BGX2 QLM mode: XLAUI 
[   31.256714] Failed to set up IOMMU for device 0004:01:10.1; retaining platform DMA ops 
[   31.264754] thunder_bgx 0004:01:10.1: BGX3 QLM mode: XLAUI 
[   31.277055] nicpf, ver 1.0 
[   31.279912] Failed to set up IOMMU for device 0002:01:00.0; retaining platform DMA ops 
[   31.280575] Failed to set up IOMMU for device 0008:21:00.0; retaining platform DMA ops  

I'm put rrichter on CC, he may be able to help point in the right direction.

Comment 21 Jeremy Poulin 2019-05-03 18:58:45 UTC
No new information was obtained from running an install post firmware reset. I've explicitly listed the results below, just to be consistent.

My next test will be to try to force acpi mode, as was Jeremy's suggestion in https://bugzilla.redhat.com/show_bug.cgi?id=1701078#c20.


Post Firmware Reset Results
===========================
F30
---------------------------
w/out modprobe.blacklist=nicvf
https://beaker.engineering.redhat.com/jobs/3511486 -> http://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2019/05/35114/3511486/6815368/console.log

Panic still occurs.

w/ modprobe.blacklist=nicvf
https://beaker.engineering.redhat.com/jobs/3511487 -> http://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2019/05/35114/3511487/6815369/console.log

Timeout on initqueue still drops to emergency shell.

F29
---------------------------
w/out modprobe.blacklist=nicvf
https://beaker.engineering.redhat.com/jobs/3511488 -> http://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2019/05/35114/3511488/6815370/console.log

Panic still occurs.

w/ modprobe.blacklist=nicvf
https://beaker.engineering.redhat.com/jobs/3511489 -> http://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2019/05/35114/3511489/6815371/console.log

Timeout on initqueue still drops to emergency shell.

Comment 22 PaulB 2019-05-03 21:06:51 UTC
(In reply to Jeremy Poulin from comment #17)
> > That does not answer my question. To re word it. Has there been any specific
> > configuration of NFV or related changes with in the firmware configuration.
> > Can you reset the firmware to default settings. We've had numerous ThunderX
> > systems (both X1 and X2) confirmed running without issues on Fedora all
> > recent versions of Fedora.
> > 
> > The only relatively recent issue we've had was an issue with their crypto
> > drivers and that was some time ago and it was fixed in time for the GA
> > release (28 I think from memory) so this issue is something specific to this
> > system hence the questions, I am trying to ascertain what is different.
> 
> We checked the host in question and there were no configuration files
> related to NFV, kvm, or anything we thought would be related. We are in the
> process of resetting the firmware back to default settings, and I'm going to
> re-run the jobs for F29 to determine if the issues persist with default
> settings. Do you need me to test anything outside of F29 on the fresh host?
> 
> Thanks!

All,
I am adding winson.lin to this BZ.
Winson is excellent and is our firmware contact for Gigabyte systems.
He would have first hand knowledge and access to the firmware change log.


Winson - can you assist in answering the question regarding the system firmware and NFV, please.


---------------
reference note:
---------------
All the gigabyte system have firmware version T49.

Please note this issue is seen when installing Fedora on the gigabyte-r270 system only.
Installing the gigabyte-r120 systems with Fedora is fine:
 https://beaker.engineering.redhat.com/jobs/3476079 - PASS
 https://beaker.engineering.redhat.com/jobs/3476079 - PASS

Also RHEL8 installs fine on both gigabyte-r270 and gigabyte-r120 systems.

Best,
-pbunyan

Comment 23 winson.lin 2019-05-04 07:55:45 UTC
Hi ALL, 

I need your side system power on console log , for CPU SKU information. 

Like as below : 

SKU:   CN8890-2000BG2601-AAP-PR-Y-G

SKU:   CN8890-2000BG2601-CP-Y-G

SKU:   CN8890-2000BG2601-ST-Y-G											


https://www.marvell.com/documents/o6h6who7rnkhiicjhbfh/

ThunderX_CP: Up to 48 highly efficient cores along with integrated vSoC, multiple 10/40 GbE and high memory
bandwidth. This family is optimized for private and public cloud web servers and content delivery, web caching and
social media data analytics workloads.

ThunderX_ST: Up to 48 highly efficient cores along with integrated vSoC, multiple SATAv3 controllers, 10/40 GbE
& PCIe Gen3 ports, high memory bandwidth, dual socket coherency, and scalable fabric for east-west as well as
north-south traffic connectivity. This family includes hardware accelerators for data protection/ integrity/security,
user to user efficient data movement (RoCE) and compressed storage. This family is optimized for Hadoop, block
& object storage, distributed file storage and hot/warm/cold storage type workloads.

ThunderX_SC: Up to 48 highly efficient cores along with integrated vSoC, 10/40 GbE connectivity, multiple PCIe
Gen3 ports, high memory bandwidth, dual socket coherency, and scalable fabric for east-west as well as
north-south traffic connectivity. The hardware accelerators include Cavium’s industry leading 4th generation
NITROX and TurboDPI technol- ogy with acceleration for IPSec, SSL, Anti-virus, Anti-malware, firewall and DPI.
This family is optimized for Secure Web frontend, security appliances and Cloud RAN type workloads.

ThunderX_NT: Up to 48 highly efficient cores along with integrated vSoC, 10/40/100 GbE connectivity, multiple
PCIe Gen3 ports, high memory bandwidth, dual socket coherency, and scalable fabric with feature rich capabilities
for bandwidth provisioning , QoS, traffic Shaping and tunnel termination. The hardware accelerators include high
packet throughput processing, network virtualization and data monitoring. This family is optimized for media
servers, scale-out embedded application and NFV type workloads

BR, Winson

Comment 24 winson.lin 2019-05-04 08:11:26 UTC
>>  I need your side system power on console log , for CPU SKU information. 

Both need early power on console log from your side gigabyte-r270 and gigabyte-r120 systems.

Thanks you.

BR, Winson

Comment 25 Jeremy Poulin 2019-05-06 15:02:56 UTC
So I tested out Jeremy Linton's suggestion to use acpi=force for Fedora 30, and that appeared to install properly:

F30 w/ acpi=force
---------------------------
https://beaker.engineering.redhat.com/jobs/3513166 -> http://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2019/05/35131/3513166/6819262/console.log

Despite the installation working correction, the job still aborts. The issue that is encountered with this build is that the "restraint" package is not available; however, this is a known issue and is being tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1699254.

The relevant lines from the log are below:

+ yum -y install restraint-rhts beakerlib beakerlib-redhat
Last metadata expiration check: 0:00:46 ago on Fri May  3 15:34:53 2019.
Error: 
 Problem: conflicting requests
  - package restraint-rhts-0.1.39-1.fc30eng.x86_64 does not have a compatible architecture
  - nothing provides restraint(x86-64) = 0.1.39-1.fc30eng needed by restraint-rhts-0.1.39-1.fc30eng.x86_64
(try to add '--skip-broken' to skip uninstallable packages)

Comment 26 Jeremy Poulin 2019-05-06 15:10:17 UTC
Winson,

The SKU for my r270 is:
SKU:   CN8890-2000BG2601-ST-Y-G

The SKU for Paul's r120 is:
SKU:   CN8880-1800BG2601-CP-Y-G

Is this all the information you need?

Comment 27 Jeremy Poulin 2019-05-06 16:37:04 UTC
Just to confirm that the acpi=force does the trick on F29, I ran the job again expecting that it would pass (since restraint is built for aarch64 for Fedora 29).

It works as expected.

F2 w/ acpi=force
================
https://beaker.engineering.redhat.com/jobs/3476079 -> http://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2019/04/34760/3476079/6744278/console.log

Comment 29 Jeff Bastian 2019-05-06 22:03:46 UTC
The upstream kernel, and thus Fedora, both prefer DeviceTree first and fall back to ACPI.  We have switched that around in RHEL to make ACPI preferred since ACPI is required by the SBSA/SBBR standards for ARM Servers (and we tried to convince upstream to do the same, but they said no).  But when testing Fedora, it's easy to forgot to add acpi=force to the kernel command line args.

Comment 30 PaulB 2019-05-08 12:36:40 UTC
(In reply to winson.lin from comment #24)
> >>  I need your side system power on console log , for CPU SKU information. 
> 
> Both need early power on console log from your side gigabyte-r270 and
> gigabyte-r120 systems.
> 
> Thanks you.
> 
> BR, Winson


Winson,
Than you for your reply. 
Jeremy added the info you requested here:
 https://bugzilla.redhat.com/show_bug.cgi?id=1701078#c26

However, as you can see this issue is resolve for Fedora with the use of acpi=force on the kernel command line.

Comment 31 Jeremy Poulin 2019-05-08 16:20:33 UTC
I don't know if it would be helpful to link back to any relevant documentation to the upstream decision to reject the change in default preference as it relates to ARM (I searched for it but was unsuccessful), but otherwise I believe this issue can be closed as resolved.

Comment 32 Jeff Bastian 2019-05-08 18:01:00 UTC
I'm also having problems now finding the discussion in the mailing list archives, but here is the patch that made ACPI the fallback mechanism upstream:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b10d79f76085b577673395daf92d6208ae09196f

If you need more documentation, let me know.


And again, the RHEL kernel has a small patch that flips the behavior around and makes ACPI preferred and DeviceTree the fallback; it just changes one variable in arch/arm64/kernel/acpi.c:

-static bool param_acpi_on __initdata;
+static bool param_acpi_on __initdata = true;

Comment 33 Jeremy Poulin 2019-05-08 18:20:29 UTC
Thanks Jeff!
I think what you have provided is sufficient context.

I am closing this as WONTFIX since while it may be an issue, the solution has already been discussed and rejected upstream.

If someone believes that there is sufficient grounds to re-open the discussion on this topic upstream, I'd welcome them to re-open this ticket to document the results of that discussion.

Comment 34 Mark Salter 2019-05-08 19:34:23 UTC
I found one thread. There are others but all end the same way.

http://lists.infradead.org/pipermail/linux-arm-kernel/2016-December/475059.html

Comment 35 PaulB 2019-05-09 18:19:20 UTC
Winson,
I brought this issue up in today's aarch64 meeting.
Seems the thought is that if the systems firmware has an option to "enable acpi" 
then the  DeviceTree  would not be offered.

All the gigabyte system have firmware version T49.
I looked thru the firmware options in T49 for both r270 and r120 and I did not see
an option to "enable acpi" specifically.

Is there a plan to add this option in future firmware release?
Apologies for the recap - but the consensus is a better resolution would be a firmware fix,
rather than kernel command line option.

==========
reference:
==========
---------------------------------------------------------------------------------------
Please note this issue is seen when installing Fedora on the gigabyte-r270 system only:
---------------------------------------------------------------------------------------
Fedora29 [4.18.5-300.fc29.aarch64]:
 https://beaker.engineering.redhat.com/jobs/3476180

Fedora28 [4.16.3-301.fc28.aarch64]:
 https://beaker.engineering.redhat.com/jobs/3482697

Fedora-Rawhide-20190417.n.0 [5.1.0-0.rc5.git1.1.fc31.aarch64]: 
 https://beaker.engineering.redhat.com/jobs/3484673

---------------------------------------------------------
Installing the gigabyte-r120 systems with Fedora is fine:
---------------------------------------------------------
 https://beaker.engineering.redhat.com/jobs/3476079 - PASS
 https://beaker.engineering.redhat.com/jobs/3476079 - PASS

Best,
-pbunyan

Comment 36 winson.lin 2019-05-10 06:23:23 UTC
Hi ALL, 

ftp://ODMcustomer:download@ftp.gigabyte.com.tw/ThunderX/BIOS/F02a/

Please download F02a for NFV. ( if still call trace , then you can adjust ACPI setup items for debug )

BR, Winson

Comment 37 winson.lin 2019-05-10 06:36:23 UTC
Hi ALL,

About Fedora29 have use on Gigabyte ThunderX2 ARM Server ? 

https://www.gigabyte.com/tw/ARM-Server/

( R281-T94 / R281-T91 / R181-T92 / R181-T90  )

BR, Winson

Comment 38 PaulB 2019-05-14 14:03:16 UTC
(In reply to winson.lin from comment #37)
> Hi ALL,
> 
> About Fedora29 have use on Gigabyte ThunderX2 ARM Server ? 
> 
> https://www.gigabyte.com/tw/ARM-Server/
> 
> ( R281-T94 / R281-T91 / R181-T92 / R181-T90  )
> 
> BR, Winson

Winson,
We have no "Gigabyte" cn99xx ThunderX2 systems at this time.
The "Gigabyte" aarch64 systems we currently have are all
cn88xx ThunderX systems.

We do have other vendor cn99xx ThunderX2 systems.
However as you know each vendor has their own firmwares.
The other vendor firmware has the enable/disable acpi option in the bios.


Also I have downloaded and updated on of our R270,T60 (cn88xx) with 
firmware  F02a:
 https://bugzilla.redhat.com/show_bug.cgi?id=1701078#c36

I see that firmware F02a has the  enable/disable acpi option.
I have enable acpi in the firmware and am currently retesting.
I will follow up when the results are complete.

Thank you for your attention/assistance, Winson.

Best,
-pbunyan

Comment 39 PaulB 2019-05-15 20:28:13 UTC
(In reply to PaulB from comment #38)
> (In reply to winson.lin from comment #37)
> > Hi ALL,
> > 
> > About Fedora29 have use on Gigabyte ThunderX2 ARM Server ? 
> > 
> > https://www.gigabyte.com/tw/ARM-Server/
> > 
> > ( R281-T94 / R281-T91 / R181-T92 / R181-T90  )
> > 
> > BR, Winson
> 
> Winson,
> We have no "Gigabyte" cn99xx ThunderX2 systems at this time.
> The "Gigabyte" aarch64 systems we currently have are all
> cn88xx ThunderX systems.
> 
> We do have other vendor cn99xx ThunderX2 systems.
> However as you know each vendor has their own firmwares.
> The other vendor firmware has the enable/disable acpi option in the bios.
> 
> 
> Also I have downloaded and updated on of our R270,T60 (cn88xx) with 
> firmware  F02a:
>  https://bugzilla.redhat.com/show_bug.cgi?id=1701078#c36
> 
> I see that firmware F02a has the  enable/disable acpi option.
> I have enable acpi in the firmware and am currently retesting.
> I will follow up when the results are complete.
> 
> Thank you for your attention/assistance, Winson.
> 
> Best,
> -pbunyan


All,
Retesting  R270,T60 (cn88xx) with firmware F02a (with acpi enabled in the bios),
I am, unfortunately, able to reproduce this issue:
 Fedora29: https://beaker.engineering.redhat.com/jobs/3536831
 Fedora30: https://beaker.engineering.redhat.com/jobs/3536832

So it seems the bios option is NOT working as expected in firmware F02a.

Best,
-pbunyan

Comment 40 yili 2019-12-24 10:54:02 UTC
Test on  R270 with firmware F02 (https://www.gigabyte.cn/ARM-Server/R270-T64-rev-110/support#support-dl-bios/)
and able to reproduce on kernel 4.19.90 from https://www.kernel.org/


Note You need to log in before you can comment on or make changes to this bug.