Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on IDPF (new Intel driver) #1797

Open
ajorg opened this issue Sep 17, 2024 · 5 comments
Open

Crash on IDPF (new Intel driver) #1797

ajorg opened this issue Sep 17, 2024 · 5 comments
Labels
F41 kind/bug status/pending-testing-release Fixed upstream. Waiting on a testing release.

Comments

@ajorg
Copy link

ajorg commented Sep 17, 2024

Describe the bug

I was trying to test CoreOS stable on GCE C3 Metal, which uses the Intel IDPF driver. I saw this crash when it tried to bring up networking on 6.10.6-200.fc40.x86_64:

[   32.819527] BUG: kernel NULL pointer dereference, address: 0000000000000008
[   32.826725] #PF: supervisor write access in kernel mode
[   32.831983] #PF: error_code(0x0002) - not-present page
[   32.837155] PGD 20c25c7067 P4D 0 
te.service�[0m -[   32.840520] Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI
[   32.846818] CPU: 119 PID: 3446 Comm: NetworkManager Not tainted 6.10.6-200.fc40.x86_64 #1
[   32.855029] Hardware name: Google Google Compute Engine/izumi, BIOS 0.20240430.6-3 06/05/2024
[   32.863581] RIP: 0010:idpf_initiate_soft_reset+0x168/0x3c0 [idpf]
[   32.869734] Code: 00 48 89 18 74 30 66 66 2e 0f 1f 84 00 00 00 00 00 90 4c 63 f5 49 81 fe 00 01 00 00 0f 83 e9 01 00 00 4a 8b 54 f0 10 83 c5 01 <48> 89 5a 08 0f b7 50 08 39 ea 7f dc 66 83 7b 20 01 75 0b 48 8b 80
[   32.888522] RSP: 0018:ff84dd88213e7618 EFLAGS: 00010202
[   32.893781] RAX: ff3c3dfd93590000 RBX: ff3c3dfd6bbf0000 RCX: 0000000000000000
[   32.900952] RDX: 0000000000000000 RSI: ff3c3dfd4cb5b180 RDI: ff3c3dfd6bbf1180
[   32.908119] RBP: 0000000000000001 R08: 0000000000000000 R09: ff3c3dfd5adc8780
[   32.915286] R10: ff84dd88213e7558 R11: 00000000000000c0 R12: 0000000000000002
[   32.922454] R13: 0000000000000001 R14: 0000000000000000 R15: ff3c3dfd4cb5a000
[   32.929621] FS:  00007fe3908b5580(0000) GS:ff3c3e1c3ff80000(0000) knlGS:0000000000000000
[   32.937742] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   32.943522] CR2: 0000000000000008 CR3: 000000208900c005 CR4: 0000000000f71ef0
[   32.950687] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   32.957855] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[   32.965023] PKRU: 55555554
[   32.967768] Call Trace:
[   32.970248]  <TASK>
[   32.972384]  ? __die_body.cold+0x19/0x27
[   32.976350]  ? page_fault_oops+0x15a/0x2f0
[   32.980488]  ? kmalloc_node_track_caller_noprof+0x21c/0x4b0
[   32.986106]  ? exc_page_fault+0x7e/0x180
[   32.990072]  ? asm_exc_page_fault+0x26/0x30
[   32.994304]  ? idpf_initiate_soft_reset+0x168/0x3c0 [idpf]
[   32.999839]  ? idpf_initiate_soft_reset+0xe8/0x3c0 [idpf]
[   33.005279]  idpf_change_mtu+0x37/0x60 [idpf]
[   33.009685]  dev_set_mtu_ext+0xed/0x200
[   33.013566]  do_setlink+0x288/0x1210
[   33.017182]  ? __rmqueue_pcplist+0xc0/0xf00
[   33.021406]  ? __nla_validate_parse+0x5f/0xd80
[   33.025891]  ? post_alloc_hook+0xe1/0x130
[   33.029939]  __rtnl_newlink+0x5ba/0xab0
[   33.033816]  rtnl_newlink+0x77/0xa0
[   33.037345]  rtnetlink_rcv_msg+0x161/0x450
[   33.041480]  ? allocate_slab+0x258/0x470
[   33.045443]  ? avc_has_perm_noaudit+0x6b/0xf0
[   33.049843]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[   33.054500]  netlink_rcv_skb+0x50/0x100
[   33.058378]  netlink_unicast+0x240/0x370
[   33.062341]  netlink_sendmsg+0x21b/0x470
[   33.066307]  ____sys_sendmsg+0x396/0x3d0
[   33.070278]  ___sys_sendmsg+0x9a/0xe0
[   33.073986]  __sys_sendmsg+0xcc/0x100
[   33.077692]  do_syscall_64+0x82/0x160
[   33.081395]  ? __count_memcg_events+0x75/0x130
[   33.085880]  ? count_memcg_events.constprop.0+0x1a/0x30
[   33.091144]  ? handle_mm_fault+0x1f0/0x300
[   33.095285]  ? do_user_addr_fault+0x36c/0x620
[   33.099681]  ? exc_page_fault+0x7e/0x180
[   33.103641]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   33.108728] RIP: 0033:0x7fe391287a4b
[   33.112371] Code: 48 89 e5 48 83 ec 20 89 55 ec 48 89 75 f0 89 7d f8 e8 29 5c f7 ff 8b 55 ec 48 8b 75 f0 41 89 c0 8b 7d f8 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 45 f8 e8 81 5c f7 ff 48 8b
[   33.131156] RSP: 002b:00007ffc7e0dffc0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
[   33.138760] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe391287a4b
[   33.145923] RDX: 0000000000000000 RSI: 00007ffc7e0e0000 RDI: 000000000000000d
[   33.153088] RBP: 00007ffc7e0dffe0 R08: 0000000000000000 R09: 0000000000000000
[   33.160253] R10: 0000000000000000 R11: 0000000000000293 R12: 000055bb82ae8390
[   33.167426] R13: 0000000000000012 R14: 00007ffc7e0e019c R15: 0000000000000000
[   33.174600]  </TASK>
 Apply nmstate o[   33.176819] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac skx_edac_common nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel sunrpc kvm spi_nor iTCO_wdt dax_hmem intel_pmc_bxt mtd pmt_telemetry cxl_acpi snd_pcsp iTCO_vendor_support rapl intel_sdsi pmt_class intel_cstate snd_pcm cxl_core ipmi_si(+) snd_timer intel_th_gth snd mei_me ipmi_devintf intel_uncore i2c_i801 ipmi_msghandler intel_th_pci einj isst_if_mbox_pci isst_if_mmio spi_intel_pci soundcore isst_if_common mei intel_vsec intel_th spi_intel i2c_ismt i2c_smbus nfnetlink xfs iaa_crypto crct10dif_pclmul crc32_pclmul nvme_tcp crc32c_intel polyval_clmulni nvme polyval_generic nvme_keyring nvme_fabrics ghash_clmulni_intel qat_4xxx sha512_ssse3 nvme_core intel_qat idpf idxd sha256_ssse3 sha1_ssse3 wmi nvme_auth crc8 idxd_bus dimlib pinctrl_emmitsburg be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp
[   33.176996]  libiscsi_tcp libiscsi scsi_transport_iscsi scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables fuse dm_multipath
[   33.280231] CR2: 0000000000000008
[   33.283575] ---[ end trace 0000000000000000 ]---
n-disk state...

[   33.303110] RIP: 0010:idpf_initiate_soft_reset+0x168/0x3c0 [idpf]
[   33.309261] Code: 00 48 89 18 74 30 66 66 2e 0f 1f 84 00 00 00 00 00 90 4c 63 f5 49 81 fe 00 01 00 00 0f 83 e9 01 00 00 4a 8b 54 f0 10 83 c5 01 <48> 89 5a 08 0f b7 50 08 39 ea 7f dc 66 83 7b 20 01 75 0b 48 8b 80
[   33.328043] RSP: 0018:ff84dd88213e7618 EFLAGS: 00010202
[   33.333299] RAX: ff3c3dfd93590000 RBX: ff3c3dfd6bbf0000 RCX: 0000000000000000
[   33.340454] RDX: 0000000000000000 RSI: ff3c3dfd4cb5b180 RDI: ff3c3dfd6bbf1180
[   33.347614] RBP: 0000000000000001 R08: 0000000000000000 R09: ff3c3dfd5adc8780
[   33.354773] R10: ff84dd88213e7558 R11: 00000000000000c0 R12: 0000000000000002
[   33.361934] R13: 0000000000000001 R14: 0000000000000000 R15: ff3c3dfd4cb5a000
[   33.369093] FS:  00007fe3908b5580(0000) GS:ff3c3e1c3ff80000(0000) knlGS:0000000000000000
[   33.377203] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   33.382978] CR2: 0000000000000008 CR3: 000000208900c005 CR4: 0000000000f71ef0
[   33.390134] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   33.397291] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[   33.404454] PKRU: 55555554
[   33.407190] note: NetworkManager[3446] exited with irqs disabled

Reproduction steps

  1. Create a copy of the Fedora CoreOS image, adding the IDPF guest OS feature flag (see gcp: support c3 metal instance types #1794)
  2. Create an instance with machine type c3-highcpu-192-metal

Expected behavior

Networking should work correctly

Actual behavior

BUG: kernel NULL pointer dereference

System details

GCE C3 Bare Metal (Intel IDPF networking)

Butane or Ignition config

(none - can't reach metadata)

Additional information

I know it worked correctly at some point in the past. I think it was back in April that I last tried.

@ajorg ajorg added the kind/bug label Sep 17, 2024
@dustymabe
Copy link
Member

dustymabe commented Sep 17, 2024

can you try with the latest testing-devel image from https://builds.coreos.fedoraproject.org/browser?stream=testing-devel&arch=x86_64

Those images should have the IDPF guest OS feature already so no need to copy the image.

if that doesn't work test with next that was released today: https://fedoraproject.org/coreos/download?stream=next&arch=x86_64#download_section

@ajorg
Copy link
Author

ajorg commented Sep 18, 2024

testing didn't appear to have the flag, but next did, and the driver did not crash on 6.11.0-0.rc7.56.fc41.x86_64 at least not on the first try.

@dustymabe
Copy link
Member

testing didn't appear to have the flag

correct. I mentioned testing-devel

the driver did not crash on 6.11.0-0.rc7.56.fc41.x86_64 at least not on the first try.

good. I'm guessing it's probably not worth chasing the fix and trying to get it backported :(

@ajorg
Copy link
Author

ajorg commented Sep 18, 2024

I tried booting the next image a few more times and didn't run into the same bug, so I suppose it may be fixed there.

@dustymabe dustymabe added status/pending-next-release Fixed upstream. Waiting on a next release. F41 labels Sep 20, 2024
@dustymabe
Copy link
Member

The fix for this went into next stream release 41.20240916.1.0. Please try out the new release and report issues.

@dustymabe dustymabe added status/pending-testing-release Fixed upstream. Waiting on a testing release. and removed status/pending-next-release Fixed upstream. Waiting on a next release. labels Sep 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
F41 kind/bug status/pending-testing-release Fixed upstream. Waiting on a testing release.
Projects
None yet
Development

No branches or pull requests

2 participants