Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1937129 - page fault with nouveau on jetson-tk1
Summary: page fault with nouveau on jetson-tk1
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 34
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: ARMTracker
TreeView+ depends on / blocked
 
Reported: 2021-03-09 22:17 UTC by Nicolas Chauvet (kwizart)
Modified: 2021-08-17 14:36 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-17 14:10:02 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
dmesg with fedora kernel. (deleted)
2021-03-10 14:30 UTC, Nicolas Chauvet (kwizart)
no flags Details

Description Nicolas Chauvet (kwizart) 2021-03-09 22:17:45 UTC
Description of problem:

I'm experiencing a nouveau driver page fault when trying to use the fedora kernel with gnome-shell on jetson-tk1 (armhfp)



Version-Release number of selected component (if applicable):
kernel-5.11.5-300.fc34.armv7hl

How reproducible:
always

Steps to Reproduce:
1. on jetson-tk1. gnome. systemctl isolate graphical
2.
3.

Actual results:
page:1706ccc7 refcount:0 mapcount:0 mapping:29d7e10e index:0x10039 pfn:0xf0481
aops:anon_aops.1 ino:48d7
flags: 0xf800000()
raw: 0f800000 eec8a24c efbe1678 c2686110 00010039 00000000 ffffffff 00000000
raw: 00000000
page dumped because: VM_BUG_ON_PAGE(((unsigned int) page_ref_count(page) + 127u <= 127u))
------------[ cut here ]------------
kernel BUG at include/linux/mm.h:1179!
Internal error: Oops - BUG: 0 [#1] SMP ARM
Modules linked in: rfkill ofpart spi_nor mtd snd_soc_tegra30_i2s snd_soc_tegra_pcm tegra_drm snd_soc_tegra_rt5640 snd_soc_tegra_utils snd_soc_rt5640 snd_hda_codec_hdmi snd_soc_rl6231 snd_hd>
CPU: 2 PID: 859 Comm: gnome-shell Not tainted 5.11.5-300.fc34.armv7hl #1
Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
PC is at get_page+0x20/0x38
LR is at __dump_page+0x110/0x464
pc : [<c04caeec>]    lr : [<c04c69ec>]    psr: 60000113
sp : c73ebdf0  ip : 2eb7a000  fp : a747e000
r10: a747f000  r9 : 0000071f  r8 : c44b5600
r7 : a747f000  r6 : c75d21fc  r5 : 00000000  r4 : eec8a224
r3 : 00000027  r2 : 00000027  r1 : 00000000  r0 : 00000059
Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 873e406a  DAC: 00000051
Process gnome-shell (pid: 859, stack limit = 0xdd661172)
Stack: (0xc73ebdf0 to 0xc73ec000)
bde0:                                     eec8a224 c04cd81c c2b399c0 eefc919c
be00: eec8a224 0000071f c2b399c0 a747f000 000f0481 00000001 00000000 c04cdbb4
be20: 00000000 00000000 00000000 00000000 00000000 c5089000 00000001 c7208d00
be40: c2b399c0 00000001 0000071f c04cdcec 0000071f 00000000 000f0481 bf29a5c0
be60: 0000071f 00000001 00000080 00000010 00000000 00000001 00000000 00000000
be80: 00000000 00000000 c352ea30 00000000 c73ebef4 c5089000 c2b399c0 c73ebfb0
bea0: 00000040 c44b5648 00000800 bf3bda84 c73ebef4 c2b399c0 00000255 a747e000
bec0: c73ebfb0 c04cb7f4 00000001 c2b399c0 00000255 c04ceff0 c1b9a894 bebbf974
bee0: c5ec4200 c099f9f4 fffffff3 c099f9f4 00000000 c2b399c0 00000255 00100cca
bf00: 00010038 a747e000 c73e69d0 c73e69d0 00000000 00000000 00000000 00000000
bf20: 00000000 eefc9164 c03002a4 c73ebfb0 a747e000 c2b399c0 c44b5600 00000805
bf40: 00000255 c44b5648 00000800 c0d37788 00000000 c04023b8 c5bdfa28 c5bdf800
bf60: c5bdfa1c 00000805 a747e000 ffffffff c73ebfb0 c1510e20 aef3db00 00001000
bf80: 00000000 c031400c a747e000 00000805 c73ebfb0 0000906e ae9757f0 40000010
bfa0: ffffffff 10c5387d 10c5387d c0300e80 0000906e 2001e000 a747e000 a747e000
bfc0: 0152f7f8 0152eee8 0152f7f8 0152ed38 0152ee58 aef3db00 00001000 00000000
bfe0: 00000000 bebbf9c0 afd22fec ae9757f0 40000010 ffffffff 00000000 00000000
[<c04caeec>] (get_page) from [<c04cd81c>] (insert_page+0xa8/0x114)
[<c04cd81c>] (insert_page) from [<c04cdbb4>] (__vm_insert_mixed+0x94/0x1ac)
[<c04cdbb4>] (__vm_insert_mixed) from [<c04cdcec>] (vmf_insert_mixed_prot+0x20/0x28)
[<c04cdcec>] (vmf_insert_mixed_prot) from [<bf29a5c0>] (ttm_bo_vm_fault_reserved+0x280/0x318 [ttm])
[<bf29a5c0>] (ttm_bo_vm_fault_reserved [ttm]) from [<bf3bda84>] (nouveau_ttm_fault+0x60/0x90 [nouveau])
[<bf3bda84>] (nouveau_ttm_fault [nouveau]) from [<c04cb7f4>] (__do_fault+0x58/0xb0)
[<c04cb7f4>] (__do_fault) from [<c04ceff0>] (handle_mm_fault+0x7c0/0x97c)
[<c04ceff0>] (handle_mm_fault) from [<c0d37788>] (do_page_fault+0x2c0/0x348)
[<c0d37788>] (do_page_fault) from [<c031400c>] (do_DataAbort+0x3c/0xbc)
[<c031400c>] (do_DataAbort) from [<c0300e80>] (__dabt_usr+0x40/0x60)
Exception stack(0xc73ebfb0 to 0xc73ebff8)
bfa0:                                     0000906e 2001e000 a747e000 a747e000
bfc0: 0152f7f8 0152eee8 0152f7f8 0152ed38 0152ee58 aef3db00 00001000 00000000
bfe0: 00000000 bebbf9c0 afd22fec ae9757f0 40000010 ffffffff
Code: e353007f 8a000002 e59f1014 ebffef94 (e7f001f2) 
---[ end trace 38b95f8878f32175 ]---

Expected results:
no page fault.

Additional info:
I'm not reproducing using the grate downstream kernel based on linux-next 20210302.
I will try to reproduce with vanilla linux-next in the coming days.

Comment 1 Nicolas Chauvet (kwizart) 2021-03-10 12:42:24 UTC
FYI, I'm not reproducing using linux-next 20210302.

Will try with 5.12-rc1...

Comment 2 Nicolas Chauvet (kwizart) 2021-03-10 13:10:57 UTC
5.12-rc1 also (still) have the page fault bug. But the triggered fault is a different one (related to polkit), and there I can have a graphical display... (but too unstable to verify gpu acceleration).


[   58.003759] BUG: Bad page state in process polkitd  pfn:ee9b1
[   58.009509] page:8a64ce78 refcount:2 mapcount:129 mapping:473e54ab index:0x0 pfn:0xee9b1
[   58.017597] aops:0xc0b0ea14 ino:1749
[   58.021177] flags: 0x40000000()
[   58.024339] raw: 40000000 00000100 00000122 c43d81f8 00000000 00000000 00000080 00000002
[   58.032422] page dumped because: nonzero _refcount
[   58.037204] Modules linked in: nouveau tegra_drm host1x drm_ttm_helper tegra_soctherm ttm iova zram zsmalloc xhci_tegra ci_hdrc_tegra phy_tegra_xusb ahci_tegra libahci_platform tegra124_e
[   58.061017] CPU: 2 PID: 689 Comm: polkitd Not tainted 5.12.0-rc2-tegra+ #198
[   58.068051] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
[   58.074305] [<c010ec40>] (unwind_backtrace) from [<c010a1ec>] (show_stack+0x10/0x14)
[   58.082039] [<c010a1ec>] (show_stack) from [<c0a86b20>] (dump_stack+0xc0/0xd4)
[   58.089250] [<c0a86b20>] (dump_stack) from [<c02341ec>] (bad_page+0xdc/0x10c)
[   58.096373] [<c02341ec>] (bad_page) from [<c02383d4>] (get_page_from_freelist+0xde8/0x116c)
[   58.104709] [<c02383d4>] (get_page_from_freelist) from [<c0238cd8>] (__alloc_pages_nodemask+0x17c/0x1014)
[   58.114258] [<c0238cd8>] (__alloc_pages_nodemask) from [<c021e478>] (__pte_alloc+0x24/0x178)
[   58.122679] [<c021e478>] (__pte_alloc) from [<c021fb40>] (copy_page_range+0x6e4/0xa18)
[   58.130580] [<c021fb40>] (copy_page_range) from [<c011f154>] (dup_mm+0x328/0x458)
[   58.138050] [<c011f154>] (dup_mm) from [<c011fee4>] (copy_process+0x980/0x16c4)
[   58.145344] [<c011fee4>] (copy_process) from [<c0120e9c>] (kernel_clone+0xa4/0x3e4)
[   58.152986] [<c0120e9c>] (kernel_clone) from [<c01214a0>] (sys_clone+0x74/0x90)
[   58.160281] [<c01214a0>] (sys_clone) from [<c01000c0>] (ret_fast_syscall+0x0/0x58)
[   58.167835] Exception stack(0xc56fffa8 to 0xc56ffff0)
[   58.172873] ffa0:                   b491e078 00000001 01200011 00000000 00000000 00000000
[   58.181032] ffc0: b491e078 00000001 b4face1c 00000078 bea4a000 b491e550 00000001 bea4a264
[   58.189188] ffe0: b491e010 bea49e38 b4f018ec b4f017fc
[   58.194225] Disabling lock debugging due to kernel taint
[   58.199523] BUG: Bad page state in process polkitd  pfn:ee9b2
[   58.205253] page:8be0376d refcount:2 mapcount:129 mapping:473e54ab index:0x0 pfn:0xee9b2
[   58.213328] aops:0xc0b0ea14 ino:1749
[   58.216892] flags: 0x40000000()
[   58.220025] raw: 40000000 00000100 00000122 c43d81f8 00000000 00000000 00000080 00000002
[   58.228096] page dumped because: nonzero _refcount
[   58.232872] Modules linked in: nouveau tegra_drm host1x drm_ttm_helper tegra_soctherm ttm iova zram zsmalloc xhci_tegra ci_hdrc_tegra phy_tegra_xusb ahci_tegra libahci_platform tegra124_e
[   58.256679] CPU: 2 PID: 689 Comm: polkitd Tainted: G    B             5.12.0-rc2-tegra+ #198
[   58.265097] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
[   58.271348] [<c010ec40>] (unwind_backtrace) from [<c010a1ec>] (show_stack+0x10/0x14)
[   58.279077] [<c010a1ec>] (show_stack) from [<c0a86b20>] (dump_stack+0xc0/0xd4)
[   58.286284] [<c0a86b20>] (dump_stack) from [<c02341ec>] (bad_page+0xdc/0x10c)
[   58.293405] [<c02341ec>] (bad_page) from [<c02383d4>] (get_page_from_freelist+0xde8/0x116c)
[   58.301739] [<c02383d4>] (get_page_from_freelist) from [<c0238cd8>] (__alloc_pages_nodemask+0x17c/0x1014)
[   58.311288] [<c0238cd8>] (__alloc_pages_nodemask) from [<c021e478>] (__pte_alloc+0x24/0x178)
[   58.319709] [<c021e478>] (__pte_alloc) from [<c021fb40>] (copy_page_range+0x6e4/0xa18)
[   58.327609] [<c021fb40>] (copy_page_range) from [<c011f154>] (dup_mm+0x328/0x458)
[   58.335077] [<c011f154>] (dup_mm) from [<c011fee4>] (copy_process+0x980/0x16c4)
[   58.342371] [<c011fee4>] (copy_process) from [<c0120e9c>] (kernel_clone+0xa4/0x3e4)
[   58.350013] [<c0120e9c>] (kernel_clone) from [<c01214a0>] (sys_clone+0x74/0x90)
[   58.357308] [<c01214a0>] (sys_clone) from [<c01000c0>] (ret_fast_syscall+0x0/0x58)
[   58.364861] Exception stack(0xc56fffa8 to 0xc56ffff0)
[   58.369900] ffa0:                   b491e078 00000001 01200011 00000000 00000000 00000000
[   58.378057] ffc0: b491e078 00000001 b4face1c 00000078 bea4a000 b491e550 00000001 bea4a264
[   58.386214] ffe0: b491e010 bea49e38 b4f018ec b4f017fc
[   58.391250] BUG: Bad page state in process polkitd  pfn:ee9b3
[   58.396981] page:32413595 refcount:2 mapcount:129 mapping:473e54ab index:0x0 pfn:0xee9b3
[   58.405054] aops:0xc0b0ea14 ino:1749

Comment 3 Nicolas Chauvet (kwizart) 2021-03-10 14:30:17 UTC
Created attachment 1762323 [details]
dmesg with fedora kernel.

Comment 4 Nicolas Chauvet (kwizart) 2021-03-10 16:37:21 UTC
As this bug is concerned:
5.10.16-200.fc33.armv7hl is known good (doesn't exhibit the page fault).
5.11.0-rc6-next-20210201-tegra+ is known bad (already exhibit the issue).

Comment 5 Nicolas Chauvet (kwizart) 2021-03-10 16:55:22 UTC
5.11.0-rc4-next-20210119-tegra+ is known bad.

Comment 6 Nicolas Chauvet (kwizart) 2021-03-10 19:59:58 UTC
461619f5c3242aaee9ec3f0b7072719bd86ea207 is the first bad commit
drm/nouveau: switch to new allocator

(Will try to revert on top of 5.11.5)

git bisect start
# bad: [5c8fe583cce542aa0b84adc939ce85293de36e5e] Linux 5.11-rc1
git bisect bad 5c8fe583cce542aa0b84adc939ce85293de36e5e
# good: [2c85ebc57b3e1817b6ce1a6b703928e113a90442] Linux 5.10
git bisect good 2c85ebc57b3e1817b6ce1a6b703928e113a90442
# bad: [2911ed9f47b47cb5ab87d03314b3b9fe008e607f] Merge tag 'char-misc-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
git bisect bad 2911ed9f47b47cb5ab87d03314b3b9fe008e607f
# bad: [ac73e3dc8acd0a3be292755db30388c3580f5674] Merge branch 'akpm' (patches from Andrew)
git bisect bad ac73e3dc8acd0a3be292755db30388c3580f5674
# bad: [b10733527bfd864605c33ab2e9a886eec317ec39] Merge tag 'amd-drm-next-5.11-2020-12-09' of git://people.freedesktop.org/~agd5f/linux into drm-next
git bisect bad b10733527bfd864605c33ab2e9a886eec317ec39
# bad: [9713158cb2a918c3f6f5522eed23cdeb61f22e75] drm/amdgpu: Add and use seperate reg headers for dcn302
git bisect bad 9713158cb2a918c3f6f5522eed23cdeb61f22e75
# bad: [c0f98d2f8b076bf3e3183aa547395f919c943a14] Merge tag 'drm-misc-next-2020-11-05' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
git bisect bad c0f98d2f8b076bf3e3183aa547395f919c943a14
# good: [6a6e5988a2657cd0c91f6f1a3e7d194599248b6d] drm/ttm: replace last move_notify with delete_mem_notify
git bisect good 6a6e5988a2657cd0c91f6f1a3e7d194599248b6d
# good: [f566fdcd6cc49a9d5b5d782f56e3e7cb243f01b8] drm/i915: Force VT'd workarounds when running as a guest OS
git bisect good f566fdcd6cc49a9d5b5d782f56e3e7cb243f01b8
# good: [e76ab2cf21c38331155ea613cdf18582f011c30f] drm/i915: Remove per-platform IIR HPD masking
git bisect good e76ab2cf21c38331155ea613cdf18582f011c30f
# bad: [268af50f38b1f2199a2e85e38073d7a25c20190c] drm/panfrost: Support cache-coherent integrations
git bisect bad 268af50f38b1f2199a2e85e38073d7a25c20190c
# good: [e000650375b65ff77c5ee852b5086f58c741179e] fbdev/atafb: Remove unused extern variables
git bisect good e000650375b65ff77c5ee852b5086f58c741179e
# bad: [461619f5c3242aaee9ec3f0b7072719bd86ea207] drm/nouveau: switch to new allocator
git bisect bad 461619f5c3242aaee9ec3f0b7072719bd86ea207
# good: [d099fc8f540add80f725014fdd4f7f49f3c58911] drm/ttm: new TT backend allocation pool v3
git bisect good d099fc8f540add80f725014fdd4f7f49f3c58911
# good: [e93b2da9799e5cb97760969f3e1f02a5bdac29fe] drm/amdgpu: switch to new allocator v2
git bisect good e93b2da9799e5cb97760969f3e1f02a5bdac29fe
# good: [0fe3cf3a53b5c1205ec7d321be1185b075dff205] drm/radeon: switch to new allocator v2
git bisect good 0fe3cf3a53b5c1205ec7d321be1185b075dff205
# first bad commit: [461619f5c3242aaee9ec3f0b7072719bd86ea207] drm/nouveau: switch to new allocator

Comment 7 Nicolas Chauvet (kwizart) 2021-08-17 14:10:02 UTC
with 5.14-rc5 as a base + tegra-next + tegra-drm-next + tegra-drm-fixes (scheduled for next) + PM patches (scheduled for 5.16, but optionals).
And using libdrm scheduled for the new tegra uABI...

I have no issue anymore to have a graphical display using Wayland on workstation Spin (jetson-tk1).

Comment 8 Nicolas Chauvet (kwizart) 2021-08-17 14:36:05 UTC
Actually, it doesn't seem that reliable on a second boot... So might need to wait for 5.16 to see more improvements (specially about iommu/memory/dGPU support...).


Note You need to log in before you can comment on or make changes to this bug.