Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1361414 - kernel BUG at net/core/skbuff.c:104 during OpenStack tempest run
Summary: kernel BUG at net/core/skbuff.c:104 during OpenStack tempest run
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 24
Hardware: x86_64
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-29 04:21 UTC by Ian Wienand
Modified: 2016-08-23 12:48 UTC (History)
7 users (show)

Fixed In Version: kernel-4.6.7-300.fc24
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-23 12:48:02 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Full log from host (99.34 KB, text/plain)
2016-07-29 04:22 UTC, Ian Wienand
no flags Details
Script to reproduce the tempest failure (3.41 KB, application/x-shellscript)
2016-07-29 05:53 UTC, Ian Wienand
no flags Details

Description Ian Wienand 2016-07-29 04:21:24 UTC
I am seeing a kernel oops during tempest runs in net/core/skbuff.c, pasted below

This is Fedora 24 (4.6.4-301.fc24.x86_64) running as a guest in a KVM environment on the OSIC cloud.  I realise this isn't much help at this point ... tempest is doing a whole bunch of adding interfaces, routers, etc. as neutron goes through it's paces.  I'll try and work on getting something more repeatable...

Just as another data point, I believe the issue happens on RAX hosts as well (Xen guests) -- however due to their different infrastructure I can not seem to get the oops from the serial log or have it logged via netconsole.  But the host dies at a similar point of testing which is very suspicious.

---
[ 7162.053779] skbuff: skb_over_panic: text:ffffffff817a3f58 len:40 put:40 head:ffff88004d9b6000 data:ffff88004d9c6010 tail:0x10038 end:0xec0 dev:qg-0d7c6880-b1
[ 7162.053921] ------------[ cut here ]------------
[ 7162.053922] kernel BUG at net/core/skbuff.c:104!
[ 7162.053924] invalid opcode: 0000 [#1] SMP 
[ 7162.053944] Modules linked in: xt_CT xt_mac xt_physdev br_netfilter xt_set ip_set_hash_net ip_set nf_conntrack_netlink nfnetlink veth xt_nat xt_REDIRECT nf_nat_redirect xt_mark xt_connmark ip6table_raw ip6table_mangle xt_comment iptable_raw xfs loop xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 nls_utf8 isofs tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_nat nfsd auth_rpcgss nfs_acl lockd grace sunrpc openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_nat_ipv4 nf_defrag_ipv6 nf_nat libcrc32c nf_log_ipv4 nf_log_common xt_LOG xt_limit nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack kvm_intel kvm ttm ppdev drm_kms_helper irqbypass drm crct10dif_pclmul crc32_pclmul acpi_cpufreq ghash_clmulni_intel virtio_net parport_pc i2c_piix4 joydev pvpanic parport tpm_tis tpm virtio_balloon virtio_blk crc32c_intel virtio_pci virtio_ring ata_generic virtio pata_acpi serio_raw
[ 7162.053956] CPU: 3 PID: 22738 Comm: ovs-vswitchd Not tainted 4.6.4-301.fc24.x86_64 #1
[ 7162.053956] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
[ 7162.053957] task: ffff880189a89e80 ti: ffff8801ac350000 task.ti: ffff8801ac350000
[ 7162.053978] RIP: 0010:[<ffffffff816ac334>]  [<ffffffff816ac334>] skb_panic+0x64/0x70
[ 7162.053988] RSP: 0018:ffff88023fd83c88  EFLAGS: 00010282
[ 7162.053988] RAX: 0000000000000091 RBX: ffff88002d5ca400 RCX: 0000000000000002
[ 7162.053989] RDX: 0000000000000000 RSI: 0000000000000296 RDI: 0000000000000296
[ 7162.053990] RBP: ffff88023fd83ca8 R08: 000000000f99a6c7 R09: 0000000000000091
[ 7162.053990] R10: ffff880237002f00 R11: 000000000000b1e4 R12: ffff88002d4b5e80
[ 7162.053991] R13: ffff88023fd83d20 R14: ffffffff81d07f00 R15: 0000000000000000
[ 7162.053992] FS:  00007f79b25e6a40(0000) GS:ffff88023fd80000(0000) knlGS:0000000000000000
[ 7162.053992] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7162.053993] CR2: 000056163abd8008 CR3: 0000000189871000 CR4: 00000000001406e0
[ 7162.053998] Stack:
[ 7162.053999]  ffff88004d9c6010 0000000000010038 0000000000000ec0 ffff8801477b3000
[ 7162.054003]  ffff88023fd83cb8 ffffffff816ac84d ffff88023fd83cf0 ffffffff817a3f58
[ 7162.054004]  ffff88002d5ca400 ffff88023fd83d20 ffff88002d4b5e80 ffff8801477b3000
[ 7162.054004] Call Trace:
[ 7162.054011]  <IRQ> 
[ 7162.054011]  [<ffffffff816ac84d>] skb_put+0x4d/0x50
[ 7162.054019]  [<ffffffff817a3f58>] ip6_mc_hdr.constprop.39+0x58/0xd0
[ 7162.054020]  [<ffffffff817a4101>] mld_newpack+0x131/0x1c0
[ 7162.054032]  [<ffffffff810651ce>] ? kvm_clock_read+0x1e/0x20
[ 7162.054034]  [<ffffffff817a421d>] add_grhead.isra.27+0x8d/0xa0
[ 7162.054035]  [<ffffffff817a4f10>] add_grec+0x390/0x4b0
[ 7162.054036]  [<ffffffff817a5df2>] mld_ifc_timer_expire+0x1f2/0x290
[ 7162.054037]  [<ffffffff817a5c00>] ? igmp6_timer_handler+0x70/0x70
[ 7162.054044]  [<ffffffff81112eb5>] call_timer_fn+0x35/0x120
[ 7162.054045]  [<ffffffff817a5c00>] ? igmp6_timer_handler+0x70/0x70
[ 7162.054046]  [<ffffffff811134fe>] run_timer_softirq+0x21e/0x2d0
[ 7162.054051]  [<ffffffff817e34e2>] __do_softirq+0x112/0x2e7
[ 7162.054056]  [<ffffffff810ad145>] irq_exit+0x105/0x110
[ 7162.054057]  [<ffffffff817e32e2>] smp_apic_timer_interrupt+0x42/0x50
[ 7162.054058]  [<ffffffff817e141c>] apic_timer_interrupt+0x8c/0xa0
[ 7162.054063]  <EOI> 
[ 7162.054063]  [<ffffffff810ffb58>] ? console_unlock+0x348/0x590
[ 7162.054065]  [<ffffffff81100037>] vprintk_emit+0x297/0x490
[ 7162.054066]  [<ffffffff811003b9>] vprintk_default+0x29/0x40
[ 7162.054073]  [<ffffffff811b68cd>] printk+0x57/0x73
[ 7162.054086]  [<ffffffff81408f41>] ? __nla_reserve+0x41/0x50
[ 7162.054091]  [<ffffffff816c53f6>] __dev_set_promiscuity+0xf6/0x1d0
[ 7162.054092]  [<ffffffff81408fe6>] ? nla_put+0x36/0x40
[ 7162.054098]  [<ffffffffc0282fa1>] ? ovs_vport_get_upcall_portids+0x71/0x80 [openvswitch]
[ 7162.054100]  [<ffffffff816c5794>] dev_set_promiscuity+0x24/0x50
[ 7162.054102]  [<ffffffffc02832d1>] internal_dev_destroy+0x31/0x60 [openvswitch]
[ 7162.054104]  [<ffffffffc0282c66>] ovs_vport_del+0x46/0x50 [openvswitch]
[ 7162.054106]  [<ffffffffc027798b>] ovs_vport_cmd_del+0x12b/0x190 [openvswitch]
[ 7162.054112]  [<ffffffff816fb04b>] genl_family_rcv_msg+0x1db/0x3c0
[ 7162.054118]  [<ffffffff812234c1>] ? __slab_free+0xc1/0x2a0
[ 7162.054119]  [<ffffffff816ad04b>] ? __alloc_skb+0x5b/0x1f0
[ 7162.054120]  [<ffffffff816fb230>] ? genl_family_rcv_msg+0x3c0/0x3c0
[ 7162.054122]  [<ffffffff816fb2b7>] genl_rcv_msg+0x87/0xc0
[ 7162.054123]  [<ffffffff816f873b>] ? netlink_lookup+0xdb/0x140
[ 7162.054124]  [<ffffffff816fa7b4>] netlink_rcv_skb+0xa4/0xc0
[ 7162.054125]  [<ffffffff816fae58>] genl_rcv+0x28/0x40
[ 7162.054126]  [<ffffffff816fa1a1>] netlink_unicast+0x191/0x220
[ 7162.054127]  [<ffffffff816fa527>] netlink_sendmsg+0x2f7/0x3b0
[ 7162.054130]  [<ffffffff816a50b8>] sock_sendmsg+0x38/0x50
[ 7162.054131]  [<ffffffff816a5b6e>] ___sys_sendmsg+0x28e/0x2a0
[ 7162.054141]  [<ffffffff816a4a3f>] ? sock_destroy_inode+0x2f/0x40
[ 7162.054145]  [<ffffffff81264e6b>] ? destroy_inode+0x3b/0x60
[ 7162.054147]  [<ffffffff81264fac>] ? evict+0x11c/0x180
[ 7162.054148]  [<ffffffff8126045e>] ? dentry_free+0x4e/0x90
[ 7162.054148]  [<ffffffff812611dd>] ? __dentry_kill+0x13d/0x1b0
[ 7162.054149]  [<ffffffff81261416>] ? dput+0x1c6/0x200
[ 7162.054151]  [<ffffffff816a64c4>] __sys_sendmsg+0x54/0x90
[ 7162.054152]  [<ffffffff816a6512>] SyS_sendmsg+0x12/0x20
[ 7162.054159]  [<ffffffff81003d92>] do_syscall_64+0x62/0x110
[ 7162.054160]  [<ffffffff817e0821>] entry_SYSCALL64_slow_path+0x25/0x25
[ 7162.054170] Code: cc 00 00 00 48 89 44 24 10 8b 87 c8 00 00 00 48 89 44 24 08 48 8b 87 d8 00 00 00 48 c7 c7 20 58 ac 81 48 89 04 24 e8 42 a5 b0 ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 e8 c7 cb d3 ff 
[ 7162.054171] RIP  [<ffffffff816ac334>] skb_panic+0x64/0x70
[ 7162.054172]  RSP <ffff88023fd83c88>
[ 7162.054443] ---[ end trace 1314dd437df3d236 ]---
[ 7162.054444] Kernel panic - not syncing: Fatal exception in interrupt
[ 7162.054472] Kernel Offset: disabled
[ 7162.055439] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
---

Comment 1 Ian Wienand 2016-07-29 04:22:32 UTC
Created attachment 1185355 [details]
Full log from host

Full serial log from host

Comment 2 Ian Wienand 2016-07-29 05:53:45 UTC
Created attachment 1185383 [details]
Script to reproduce the tempest failure

Script to reproduce tempest failure

Comment 3 Ian Wienand 2016-08-01 05:29:42 UTC
This replicates with the current rawhide kernel 4.8.0-0.rc0.git3.1.fc26.x86_64

---
[ 2256.904429] ------------[ cut here ]------------
[ 2256.904430] kernel BUG at net/core/skbuff.c:105!
[ 2256.904431] invalid opcode: 0000 [#1] SMP
[ 2256.904455] Modules linked in: nf_conntrack_netlink xt_CT xt_mac xt_physdev br_netfilter xt_set ip_set_hash_net ip_set nfnetlink veth xt_nat xt_REDIRECT nf_nat_redirect xt_mark xt_connmark ip6table_raw ip6table_mangle xt_comment iptable_raw xfs loop xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 nls_utf8 isofs tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_nat nfsd auth_rpcgss nfs_acl lockd grace sunrpc openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_nat_ipv4 nf_defrag_ipv6 nf_nat libcrc32c nf_log_ipv4 nf_log_common xt_LOG xt_limit nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev joydev ttm drm_kms_helper virtio_balloon parport_pc acpi_cpufreq pvpanic drm i2c_piix4
[ 2256.904461]  tpm_tis parport tpm virtio_net virtio_blk crc32c_intel serio_raw virtio_pci virtio_ring ata_generic pata_acpi virtio
[ 2256.904464] CPU: 3 PID: 6510 Comm: httpd Not tainted 4.8.0-0.rc0.git3.1.fc26.x86_64 #1
[ 2256.904464] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
[ 2256.904465] task: ffffa01bea8eb240 task.stack: ffffa01bea8fc000
[ 2256.904482] RIP: 0010:[<ffffffff8377ed64>]  [<ffffffff8377ed64>] skb_panic+0x64/0x70
[ 2256.904483] RSP: 0000:ffffa01c37403c18  EFLAGS: 00010286
[ 2256.904484] RAX: 0000000000000091 RBX: ffffa01a899d4300 RCX: ffffa01bea8eb240
[ 2256.904484] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000286
[ 2256.904485] RBP: ffffa01c37403c38 R08: 0000000000000001 R09: 0000000000000001
[ 2256.904485] R10: ffffa01a0a0be000 R11: 000000000001cbe4 R12: ffffa01a45845480
[ 2256.904486] R13: ffffa01c37403cb0 R14: ffffffff83f75ca0 R15: 0000000000000000
[ 2256.904487] FS:  00007f265a42c700(0000) GS:ffffa01c37400000(0000) knlGS:0000000000000000
[ 2256.904488] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2256.904489] CR2: 00007f264dd13468 CR3: 00000001ea3ee000 CR4: 00000000001406e0
[ 2256.904493] Stack:
[ 2256.904496]  ffffa01a0a0ce010 0000000000010038 0000000000000ec0 ffffa01c354a4000
[ 2256.904497]  ffffa01c37403c48 ffffffff8377efdd ffffa01c37403c80 ffffffff838a4348
[ 2256.904498]  ffffa01a899d4300 ffffa01c37403cb0 ffffa01a45845480 ffffa01c354a4000
[ 2256.904498] Call Trace:
[ 2256.904504]  <IRQ> 
[ 2256.904504]  [<ffffffff8377efdd>] skb_put+0x4d/0x50
[ 2256.904513]  [<ffffffff838a4348>] ip6_mc_hdr.constprop.42+0x58/0xd0
[ 2256.904514]  [<ffffffff838a44f1>] mld_newpack+0x131/0x1c0
[ 2256.904516]  [<ffffffff838a460d>] add_grhead.isra.30+0x8d/0xa0
[ 2256.904518]  [<ffffffff838a5ff0>] add_grec+0x390/0x4b0
[ 2256.904520]  [<ffffffff838a6533>] mld_ifc_timer_expire+0x203/0x2a0
[ 2256.904521]  [<ffffffff838a6330>] ? mld_dad_timer_expire+0x60/0x60
[ 2256.904533]  [<ffffffff831390df>] call_timer_fn+0xaf/0x340
[ 2256.904534]  [<ffffffff83139035>] ? call_timer_fn+0x5/0x340
[ 2256.904536]  [<ffffffff838a6330>] ? mld_dad_timer_expire+0x60/0x60
[ 2256.904537]  [<ffffffff8313a00e>] run_timer_softirq+0x26e/0x660
[ 2256.904544]  [<ffffffff83037e89>] ? sched_clock+0x9/0x10
[ 2256.904549]  [<ffffffff830ea307>] ? sched_clock_local+0x17/0x80
[ 2256.904555]  [<ffffffff838f5846>] __do_softirq+0xd6/0x4c5
[ 2256.904559]  [<ffffffff830b5fd1>] irq_exit+0xf1/0x100
[ 2256.904560]  [<ffffffff838f5502>] smp_apic_timer_interrupt+0x42/0x50
[ 2256.904573]  [<ffffffff838f3566>] apic_timer_interrupt+0x96/0xa0
[ 2256.904575]  <EOI> 
[ 2256.904587] Code: cc 00 00 00 48 89 44 24 10 8b 87 c8 00 00 00 48 89 44 24 08 48 8b 87 d8 00 00 00 48 c7 c7 88 17 d1 83 48 89 04 24 e8 1a 5d a7 ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 e8 b7 42 cf ff 
[ 2256.904589] RIP  [<ffffffff8377ed64>] skb_panic+0x64/0x70
[ 2256.904589]  RSP <ffffa01c37403c18>
---

Comment 4 Ian Wienand 2016-08-02 04:44:28 UTC
Here's what I think is going on

[ 2061.740016] skbuff: skb_over_panic: text:ffffffff817a3f58 len:40 put:40 head:ffff880022d53000 data:ffff880022d63010 tail:0x10038 end:0xec0 dev:tap07c424ec-90

So tail is greater than end, by 

  0x10038 - 0xec0 = 0xf178

That is a rather suspicious number.  This seems most likely to be set by skb_reserve(skb, hlen), where

  hlen = LL_RESERVED_SPACE(dev)

Where LL_RESERVED_SPACE comes from

  #define LL_RESERVED_SPACE(dev) \
  257         ((((dev)->hard_header_len+(dev)->needed_headroom)&~(HH_DATA_MOD - 1)) + HH_DATA_MOD)

I pulled the skb reference out of the registers and checked out
dev->needed_headroom

 crash> struct sk_buff.dev ffff880152d43600
   dev = 0xffff880138885000
 crash> struct net_device.needed_headroom 0xffff880138885000
   needed_headroom = 65535

So there's our 0xffff

Now I notice in 871b642adebe300be2e50aa5f65a418510f636ec [1] we
introduced ndo_set_rx_headroom to net_device with the comment

 + * void (*ndo_set_rx_headroom)(struct net_device *dev, int needed_headroom);
 + *     This function is used to specify the headroom that the skb must
 + *     consider when allocation skb during packet reception. Setting
 + *     appropriate rx headroom value allows avoiding skb head copy on
 + *     forward. Setting a negative value reset the rx headroom to the
 + *     default value.

So I suspect we're seeing a negative value here being propagated
into net_device->needed_headroom (incorectly?).

This is a tap device 

 crash> struct net_device.name 0xffff880138885000
 name = "tap07c424ec-90\000"

But most certainly attached to an ovs bridge (not exactly sure how to go from a net_device to see what it's attached to)

The only place I can see this happening is in net/openvswitch/vport-internal_dev.c (via 3a927bc7cf9d0fbe8f4a8189dd5f8440228f64e7 [2]) where
ndo_set_rx_headroom has been implemented with

 +void internal_set_rx_headroom(struct net_device *dev, int new_hr)
 +{
 +       dev->needed_headroom = new_hr;
 +}
 +

So I'm wondering, should that be

 new_hr < 0 ? 0 : new_hdr

So that a negative value (i.e. the default mentioned in ndo_set_rx_headroom) doesn't get propagated?

Or, is this a miscalculation in mld_newpack?

[1] http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=871b642adebe300be2e50aa5f65a418510f636ec
[2] http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3a927bc7cf9d0fbe8f4a8189dd5f8440228f64e7

Comment 5 Ian Wienand 2016-08-02 23:22:15 UTC
Thanks to Ben for pointing out to me I should have followed the device ops which leads back to it being an ovs port.  this makes even more sense

---
struct net_device 0xffff880138885000
  name = "tap07c424ec-90\000",
  [...]
  netdev_ops = 0xffffffffc0251be0,
 
struct net_device_ops 0xffffffffc024e0e0
  ndo_open = 0xffffffffc024e0e0 <internal_dev_open>,
  ndo_set_rx_headroom = 0xffffffffc024e140 <internal_set_rx_headroom>
 
kmem 0xffffffffc024e0e0
ffffffffc024e0e0 (t) internal_dev_open [openvswitch]
---

Comment 8 Laura Abbott 2016-08-12 22:05:09 UTC
I dropped the patch in rawhide, should show up in a few days. I'm going to leave this bug open until verified that it is fixed.

Comment 9 Fedora Update System 2016-08-18 15:43:38 UTC
kernel-4.6.7-300.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-5e24d8c350

Comment 10 Fedora Update System 2016-08-19 00:57:17 UTC
kernel-4.6.7-300.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-5e24d8c350

Comment 11 Fedora Update System 2016-08-23 12:47:34 UTC
kernel-4.6.7-300.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.