Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

page allocation stalls #9

Open
liyi-ibm opened this issue Dec 11, 2018 · 4 comments
Open

page allocation stalls #9

liyi-ibm opened this issue Dec 11, 2018 · 4 comments

Comments

@liyi-ibm
Copy link
Owner

When free memory is low, applications that allocate memory have to wait until some memory is reclaimed. If the delay is greater than 10 seconds, an allocation stall message is printed.

An upstream patch has been identified that should help reduce overhead and latencies. This patch is currently being backported and tested.
commit a983b5e
Author: Johannes Weiner [email protected]
Date: Wed Jan 31 16:16:45 2018 -0800

mm: memcontrol: fix excessive complexity in memory.stat reporting

@liyi-ibm
Copy link
Owner Author

committed above patch (and dependency) as: aa243dd, 23da735, 848ed2a

@liyi-ibm
Copy link
Owner Author

Note: in page allocation stalls we still see many available memory, but the memory cannot be reclaimed.

In another case, there is OOM followed by stall, the stall happens because out-of-memory:

[Wed Dec  5 22:03:18 2018] collect_kprobe.: page allocation stalls for 14860ms, order:0, mode:0x15080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), nodemask=(null)
[Wed Dec  5 22:03:18 2018] collect_kprobe. cpuset=/ mems_allowed=0,8
[Wed Dec  5 22:03:18 2018] CPU: 44 PID: 6412 Comm: collect_kprobe. Not tainted 4.14.49-memctrl #1
[Wed Dec  5 22:03:18 2018] Call Trace:
[Wed Dec  5 22:03:18 2018] [c000001fdd04b950] [c000000000ad8b14] dump_stack+0xe8/0x154 (unreliable)
[Wed Dec  5 22:03:18 2018] [c000001fdd04b990] [c00000000028d8a8] warn_alloc+0x128/0x1c0
[Wed Dec  5 22:03:18 2018] [c000001fdd04ba40] [c00000000028e3b4] __alloc_pages_nodemask+0x9e4/0x1080
[Wed Dec  5 22:03:18 2018] [c000001fdd04bc30] [c000000000317140] alloc_pages_current+0xa0/0x130
[Wed Dec  5 22:03:18 2018] [c000001fdd04bc70] [c000000000287cc8] __get_free_pages+0x28/0x90
[Wed Dec  5 22:03:18 2018] [c000001fdd04bc90] [c0000000000f8af0] mm_init+0x150/0x2f0
[Wed Dec  5 22:03:18 2018] [c000001fdd04bcd0] [c0000000000fa738] copy_process.isra.31.part.32+0x9a8/0x19f0
[Wed Dec  5 22:03:18 2018] [c000001fdd04bdc0] [c0000000000fb97c] _do_fork+0xdc/0x4b0
[Wed Dec  5 22:03:18 2018] [c000001fdd04be30] [c00000000000c304] ppc_clone+0x8/0xc
[Wed Dec  5 22:03:18 2018] Mem-Info:
[Wed Dec  5 22:03:18 2018] active_anon:1930707 inactive_anon:1789300 isolated_anon:0
 active_file:144 inactive_file:299 isolated_file:0
 unevictable:0 dirty:0 writeback:157 unstable:0
 slab_reclaimable:47303 slab_unreclaimable:201360
 mapped:639 shmem:526 pagetables:9075 bounce:0
 free:8106 free_pcp:27 free_cma:4524
[Wed Dec  5 22:03:18 2018] Node 0 active_anon:64086656kB inactive_anon:60984640kB active_file:6080kB inactive_file:9600kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:16832kB dirty:0kB writeback:5952kB shmem:33216kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 14336kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[Wed Dec  5 22:03:18 2018] Node 8 active_anon:59478592kB inactive_anon:53530560kB active_file:3136kB inactive_file:9536kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:24064kB dirty:0kB writeback:4096kB shmem:448kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[Wed Dec  5 22:03:18 2018] Node 0 DMA free:165056kB min:179712kB low:312896kB high:446080kB active_anon:64086656kB inactive_anon:60984640kB active_file:7552kB inactive_file:4928kB unevictable:0kB writepending:0kB present:134217728kB managed:133227968kB mlocked:0kB kernel_stack:43136kB pagetables:285184kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[Wed Dec  5 22:03:18 2018] lowmem_reserve[]: 0 0 0 0
[Wed Dec  5 22:03:18 2018] Node 8 DMA free:353728kB min:180672kB low:314560kB high:448448kB active_anon:59478592kB inactive_anon:53530560kB active_file:15552kB inactive_file:17280kB unevictable:0kB writepending:0kB present:134217728kB managed:133945792kB mlocked:0kB kernel_stack:107600kB pagetables:295616kB bounce:0kB free_pcp:1728kB local_pcp:0kB free_cma:289536kB
[Wed Dec  5 22:03:18 2018] lowmem_reserve[]: 0 0 0 0
[Wed Dec  5 22:03:18 2018] Node 0 DMA: 2583*64kB (UME) 6*128kB (ME) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 166080kB
[Wed Dec  5 22:03:18 2018] Node 8 DMA: 2980*64kB (UMEC) 352*128kB (MEC) 261*256kB (C) 66*512kB (C) 10*1024kB (C) 2*2048kB (C) 0*4096kB 0*8192kB 0*16384kB = 350720kB
[Wed Dec  5 22:03:18 2018] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Wed Dec  5 22:03:18 2018] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Wed Dec  5 22:03:18 2018] Node 8 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Wed Dec  5 22:03:18 2018] Node 8 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Wed Dec  5 22:03:18 2018] 2898 total pagecache pages
[Wed Dec  5 22:03:18 2018] 763 pages in swap cache
[Wed Dec  5 22:03:18 2018] Swap cache stats: add 71856, delete 71093, find 26068/42083
[Wed Dec  5 22:03:18 2018] Free swap  = 0kB
[Wed Dec  5 22:03:18 2018] Total swap = 524224kB
[Wed Dec  5 22:03:18 2018] 4194304 pages RAM
[Wed Dec  5 22:03:18 2018] 0 pages HighMem/MovableOnly
[Wed Dec  5 22:03:18 2018] 19714 pages reserved
[Wed Dec  5 22:03:18 2018] 209920 pages cma reserved
[Wed Dec  5 22:03:18 2018] 0 pages hwpoisoned

@liyi-ibm
Copy link
Owner Author

[Wed Dec  5 22:03:19 2018] Node 8 active_anon:59477376kB inactive_anon:53530560kB active_file:27520kB inactive_file:14848kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:30400kB dirty:11264kB writeback:10432kB shmem:448kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no```

Active/inactive file memory is shrunk down to almost nothing, and anon can't be reclaimed because no swap.
Reclaimable slab is a bit bigger, but reclaimable slab doesn't mean it can be reclaimed right now.
Looks pretty out of memory.

@liyi-ibm
Copy link
Owner Author

If stall not caused by OOM, there are free swap, and there are many inactive file memory:

Nov 21 14:39:31 tdw-9-10-28-232 kernel: java: page allocation stalls for 16180ms, order:0, mode:0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null)
Nov 21 14:39:31 tdw-9-10-28-232 kernel: java cpuset=/ mems_allowed=0,8
Nov 21 14:39:31 tdw-9-10-28-232 kernel: CPU: 77 PID: 129639 Comm: java Tainted: G        W       4.14.49-3.ppc64le #1
Nov 21 14:39:31 tdw-9-10-28-232 kernel: Call Trace:
Nov 21 14:39:31 tdw-9-10-28-232 kernel: [c00020081ae578f0] [c000000000acb99c] dump_stack+0xb0/0xf4 (unreliable)
Nov 21 14:39:31 tdw-9-10-28-232 kernel: [c00020081ae57930] [c000000000283e28] warn_alloc+0x128/0x1c0
Nov 21 14:39:31 tdw-9-10-28-232 kernel: [c00020081ae579e0] [c000000000284934] __alloc_pages_nodemask+0x9e4/0x1080
Nov 21 14:39:31 tdw-9-10-28-232 kernel: [c00020081ae57bd0] [c00000000030e878] alloc_pages_vma+0xa8/0x2a0
Nov 21 14:39:31 tdw-9-10-28-232 kernel: [c00020081ae57c40] [c0000000002d2730] __handle_mm_fault+0xef0/0x1ee0
Nov 21 14:39:31 tdw-9-10-28-232 kernel: [c00020081ae57d20] [c0000000002d3848] handle_mm_fault+0x128/0x200
Nov 21 14:39:31 tdw-9-10-28-232 kernel: [c00020081ae57d60] [c00000000006498c] __do_page_fault+0x1cc/0x8c0
Nov 21 14:39:31 tdw-9-10-28-232 kernel: [c00020081ae57e30] [c00000000000aca4] handle_page_fault+0x18/0x38
Nov 21 14:39:31 tdw-9-10-28-232 kernel: warn_alloc_show_mem: 4 callbacks suppressed
Nov 21 14:39:31 tdw-9-10-28-232 kernel: Mem-Info:
Nov 21 14:39:31 tdw-9-10-28-232 kernel: active_anon:1420266 inactive_anon:642206 isolated_anon:0#012 active_file:263518 inactive_file:607740 isolated_file:231#012 unevictable:0 dirty:547 writeback:0 unstable:0#012 slab_reclaimable:89790 slab_unreclaimable:182653#012 mapped:2629 shmem:11995 pagetables:6249 bounce:0#012 free:5572 free_pcp:860 free_cma:44
Nov 21 14:39:31 tdw-9-10-28-232 kernel: Node 0 active_anon:44499968kB inactive_anon:19903296kB active_file:7169536kB inactive_file:15138624kB unevictable:0kB isolated(anon):0kB isolated(file):6400kB mapped:105536kB dirty:16128kB writeback:0kB shmem:527488kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Nov 21 14:39:31 tdw-9-10-28-232 kernel: Node 8 active_anon:46397056kB inactive_anon:21197888kB active_file:9695616kB inactive_file:23756736kB unevictable:0kB isolated(anon):0kB isolated(file):8384kB mapped:62720kB dirty:18880kB writeback:0kB shmem:240192kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Nov 21 14:39:31 tdw-9-10-28-232 kernel: Node 0 DMA free:179584kB min:179712kB low:312896kB high:446080kB active_anon:44461568kB inactive_anon:20003904kB active_file:7169536kB inactive_file:15053248kB unevictable:0kB writepending:4864kB present:134217728kB managed:133228160kB mlocked:0kB kernel_stack:149648kB pagetables:223296kB bounce:0kB free_pcp:26304kB local_pcp:0kB free_cma:0kB
Nov 21 14:39:31 tdw-9-10-28-232 kernel: lowmem_reserve[]: 0 0 0 0
Nov 21 14:39:31 tdw-9-10-28-232 kernel: Node 8 DMA free:177024kB min:180672kB low:314560kB high:448448kB active_anon:46378112kB inactive_anon:21287744kB active_file:9683072kB inactive_file:23644288kB unevictable:0kB writepending:6720kB present:134217728kB managed:133945792kB mlocked:0kB kernel_stack:95584kB pagetables:176640kB bounce:0kB free_pcp:28736kB local_pcp:0kB free_cma:2816kB
Nov 21 14:39:31 tdw-9-10-28-232 kernel: lowmem_reserve[]: 0 0 0 0
Nov 21 14:39:31 tdw-9-10-28-232 kernel: Node 0 DMA: 2126*64kB (UME) 432*128kB (UME) 6*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 192896kB
Nov 21 14:39:31 tdw-9-10-28-232 kernel: Node 8 DMA: 921*64kB (UMEHC) 834*128kB (UMEHC) 112*256kB (UMH) 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 194368kB
Nov 21 14:39:31 tdw-9-10-28-232 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Nov 21 14:39:31 tdw-9-10-28-232 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Nov 21 14:39:31 tdw-9-10-28-232 kernel: Node 8 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Nov 21 14:39:31 tdw-9-10-28-232 kernel: Node 8 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Nov 21 14:39:31 tdw-9-10-28-232 kernel: 883380 total pagecache pages
Nov 21 14:39:31 tdw-9-10-28-232 kernel: 1 pages in swap cache
Nov 21 14:39:31 tdw-9-10-28-232 kernel: Swap cache stats: add 2, delete 1, find 0/0
Nov 21 14:39:31 tdw-9-10-28-232 kernel: Free swap  = 524096kB
Nov 21 14:39:31 tdw-9-10-28-232 kernel: Total swap = 524224kB
Nov 21 14:39:31 tdw-9-10-28-232 kernel: 4194304 pages RAM
Nov 21 14:39:31 tdw-9-10-28-232 kernel: 0 pages HighMem/MovableOnly
Nov 21 14:39:31 tdw-9-10-28-232 kernel: 19711 pages reserved
Nov 21 14:39:31 tdw-9-10-28-232 kernel: 209920 pages cma reserved
Nov 21 14:39:31 tdw-9-10-28-232 kernel: 0 pages hwpoisoned

liyi-ibm pushed a commit that referenced this issue Dec 17, 2018
…ilure

While forking, if delayacct init fails due to memory shortage, it
continues expecting all delayacct users to check task->delays pointer
against NULL before dereferencing it, which all of them used to do.

Commit c96f547 ("delayacct: Account blkio completion on the correct
task"), while updating delayacct_blkio_end() to take the target task
instead of always using %current, made the function test NULL on
%current->delays and then continue to operated on @p->delays.  If
%current succeeded init while @p didn't, it leads to the following
crash.

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
 IP: __delayacct_blkio_end+0xc/0x40
 PGD 8000001fd07e1067 P4D 8000001fd07e1067 PUD 1fcffbb067 PMD 0
 Oops: 0000 [#1] SMP PTI
 CPU: 4 PID: 25774 Comm: QIOThread0 Not tainted 4.16.0-9_fbk1_rc2_1180_g6b593215b4d7 #9
 RIP: 0010:__delayacct_blkio_end+0xc/0x40
 Call Trace:
  try_to_wake_up+0x2c0/0x600
  autoremove_wake_function+0xe/0x30
  __wake_up_common+0x74/0x120
  wake_up_page_bit+0x9c/0xe0
  mpage_end_io+0x27/0x70
  blk_update_request+0x78/0x2c0
  scsi_end_request+0x2c/0x1e0
  scsi_io_completion+0x20b/0x5f0
  blk_mq_complete_request+0xa2/0x100
  ata_scsi_qc_complete+0x79/0x400
  ata_qc_complete_multiple+0x86/0xd0
  ahci_handle_port_interrupt+0xc9/0x5c0
  ahci_handle_port_intr+0x54/0xb0
  ahci_single_level_irq_intr+0x3b/0x60
  __handle_irq_event_percpu+0x43/0x190
  handle_irq_event_percpu+0x20/0x50
  handle_irq_event+0x2a/0x50
  handle_edge_irq+0x80/0x1c0
  handle_irq+0xaf/0x120
  do_IRQ+0x41/0xc0
  common_interrupt+0xf/0xf

Fix it by updating delayacct_blkio_end() check @p->delays instead.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: c96f547 ("delayacct: Account blkio completion on the correct task")
Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Dave Jones <[email protected]>
Debugged-by: Dave Jones <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Josh Snyder <[email protected]>
Cc: <[email protected]>	[4.15+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
liyi-ibm pushed a commit that referenced this issue Dec 28, 2018
[ Upstream commit 9954b80 ]

platform_domain_notifier contains a variable sized array, which the
pm_clk_notify() notifier treats as a NULL terminated array:

     for (con_id = clknb->con_ids; *con_id; con_id++)
             pm_clk_add(dev, *con_id);

Omitting the initialiser for con_ids means that the array is zero
sized, and there is no NULL terminator.  This leads to pm_clk_notify()
overrunning into what ever structure follows, which may not be NULL.
This leads to an oops:

Unable to handle kernel NULL pointer dereference at virtual address 0000008c
pgd = c0003000
[0000008c] *pgd=80000800004003c, *pmd=00000000c
Internal error: Oops: 206 [#1] PREEMPT SMP ARM
Modules linked in:c
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.16.0+ #9
Hardware name: Keystone
PC is at strlen+0x0/0x34
LR is at kstrdup+0x18/0x54
pc : [<c0623340>]    lr : [<c0111d6c>]    psr: 20000013
sp : eec73dc0  ip : eed780c0  fp : 00000001
r10: 00000000  r9 : 00000000  r8 : eed71e10
r7 : 0000008c  r6 : 0000008c  r5 : 014000c0  r4 : c03a6ff4
r3 : c09445d0  r2 : 00000000  r1 : 014000c0  r0 : 0000008c
Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
Control: 30c5387d  Table: 00003000  DAC: fffffffd
Process swapper/0 (pid: 1, stack limit = 0xeec72210)
Stack: (0xeec73dc0 to 0xeec74000)
...
[<c0623340>] (strlen) from [<c0111d6c>] (kstrdup+0x18/0x54)
[<c0111d6c>] (kstrdup) from [<c03a6ff4>] (__pm_clk_add+0x58/0x120)
[<c03a6ff4>] (__pm_clk_add) from [<c03a731c>] (pm_clk_notify+0x64/0xa8)
[<c03a731c>] (pm_clk_notify) from [<c004614c>] (notifier_call_chain+0x44/0x84)
[<c004614c>] (notifier_call_chain) from [<c0046320>] (__blocking_notifier_call_chain+0x48/0x60)
[<c0046320>] (__blocking_notifier_call_chain) from [<c0046350>] (blocking_notifier_call_chain+0x18/0x20)
[<c0046350>] (blocking_notifier_call_chain) from [<c0390234>] (device_add+0x36c/0x534)
[<c0390234>] (device_add) from [<c047fc00>] (of_platform_device_create_pdata+0x70/0xa4)
[<c047fc00>] (of_platform_device_create_pdata) from [<c047fea0>] (of_platform_bus_create+0xf0/0x1ec)
[<c047fea0>] (of_platform_bus_create) from [<c047fff8>] (of_platform_populate+0x5c/0xac)
[<c047fff8>] (of_platform_populate) from [<c08b1f04>] (of_platform_default_populate_init+0x8c/0xa8)
[<c08b1f04>] (of_platform_default_populate_init) from [<c000a78c>] (do_one_initcall+0x3c/0x164)
[<c000a78c>] (do_one_initcall) from [<c087bd9c>] (kernel_init_freeable+0x10c/0x1d0)
[<c087bd9c>] (kernel_init_freeable) from [<c0628db0>] (kernel_init+0x8/0xf0)
[<c0628db0>] (kernel_init) from [<c00090d8>] (ret_from_fork+0x14/0x3c)
Exception stack(0xeec73fb0 to 0xeec73ff8)
3fa0:                                     00000000 00000000 00000000 00000000
3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
Code: e3520000 1afffff7 e12fff1e c0801730 (e5d02000)
---[ end trace cafa8f148e262e80 ]---

Fix this by adding the necessary initialiser.

Fixes: fc20ffe ("ARM: keystone: add PM domain support for clock management")
Signed-off-by: Russell King <[email protected]>
Acked-by: Santosh Shilimkar <[email protected]>
Signed-off-by: Olof Johansson <[email protected]>

Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
liyi-ibm pushed a commit that referenced this issue Dec 28, 2018
[ Upstream commit 2efd4fc ]

Syzbot reported a read beyond the end of the skb head when returning
IPV6_ORIGDSTADDR:

  BUG: KMSAN: kernel-infoleak in put_cmsg+0x5ef/0x860 net/core/scm.c:242
  CPU: 0 PID: 4501 Comm: syz-executor128 Not tainted 4.17.0+ #9
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
  Google 01/01/2011
  Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x185/0x1d0 lib/dump_stack.c:113
    kmsan_report+0x188/0x2a0 mm/kmsan/kmsan.c:1125
    kmsan_internal_check_memory+0x138/0x1f0 mm/kmsan/kmsan.c:1219
    kmsan_copy_to_user+0x7a/0x160 mm/kmsan/kmsan.c:1261
    copy_to_user include/linux/uaccess.h:184 [inline]
    put_cmsg+0x5ef/0x860 net/core/scm.c:242
    ip6_datagram_recv_specific_ctl+0x1cf3/0x1eb0 net/ipv6/datagram.c:719
    ip6_datagram_recv_ctl+0x41c/0x450 net/ipv6/datagram.c:733
    rawv6_recvmsg+0x10fb/0x1460 net/ipv6/raw.c:521
    [..]

This logic and its ipv4 counterpart read the destination port from
the packet at skb_transport_offset(skb) + 4.

With MSG_MORE and a local SOCK_RAW sender, syzbot was able to cook a
packet that stores headers exactly up to skb_transport_offset(skb) in
the head and the remainder in a frag.

Call pskb_may_pull before accessing the pointer to ensure that it lies
in skb head.

Link: http://lkml.kernel.org/r/CAF=yD-LEJwZj5a1-bAAj2Oy_hKmGygV6rsJ_WOrAYnv-fnayiQ@mail.gmail.com
Reported-by: [email protected]
Signed-off-by: Willem de Bruijn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
liyi-ibm pushed a commit that referenced this issue Dec 28, 2018
…ilure

commit b512719 upstream.

While forking, if delayacct init fails due to memory shortage, it
continues expecting all delayacct users to check task->delays pointer
against NULL before dereferencing it, which all of them used to do.

Commit c96f547 ("delayacct: Account blkio completion on the correct
task"), while updating delayacct_blkio_end() to take the target task
instead of always using %current, made the function test NULL on
%current->delays and then continue to operated on @p->delays.  If
%current succeeded init while @p didn't, it leads to the following
crash.

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
 IP: __delayacct_blkio_end+0xc/0x40
 PGD 8000001fd07e1067 P4D 8000001fd07e1067 PUD 1fcffbb067 PMD 0
 Oops: 0000 [#1] SMP PTI
 CPU: 4 PID: 25774 Comm: QIOThread0 Not tainted 4.16.0-9_fbk1_rc2_1180_g6b593215b4d7 #9
 RIP: 0010:__delayacct_blkio_end+0xc/0x40
 Call Trace:
  try_to_wake_up+0x2c0/0x600
  autoremove_wake_function+0xe/0x30
  __wake_up_common+0x74/0x120
  wake_up_page_bit+0x9c/0xe0
  mpage_end_io+0x27/0x70
  blk_update_request+0x78/0x2c0
  scsi_end_request+0x2c/0x1e0
  scsi_io_completion+0x20b/0x5f0
  blk_mq_complete_request+0xa2/0x100
  ata_scsi_qc_complete+0x79/0x400
  ata_qc_complete_multiple+0x86/0xd0
  ahci_handle_port_interrupt+0xc9/0x5c0
  ahci_handle_port_intr+0x54/0xb0
  ahci_single_level_irq_intr+0x3b/0x60
  __handle_irq_event_percpu+0x43/0x190
  handle_irq_event_percpu+0x20/0x50
  handle_irq_event+0x2a/0x50
  handle_edge_irq+0x80/0x1c0
  handle_irq+0xaf/0x120
  do_IRQ+0x41/0xc0
  common_interrupt+0xf/0xf

Fix it by updating delayacct_blkio_end() check @p->delays instead.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: c96f547 ("delayacct: Account blkio completion on the correct task")
Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Dave Jones <[email protected]>
Debugged-by: Dave Jones <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Josh Snyder <[email protected]>
Cc: <[email protected]>	[4.15+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
liyi-ibm pushed a commit that referenced this issue Dec 28, 2018
[ Upstream commit 4f4616c ]

Similar to what we do when we remove a PCI function, set the
QEDF_UNLOADING flag to prevent any requests from being queued while a
vport is being deleted.  This prevents any requests from getting stuck
in limbo when the vport is unloaded or deleted.

Fixes the crash:

PID: 106676  TASK: ffff9a436aa90000  CPU: 12  COMMAND: "multipathd"
 #0 [ffff9a43567d3550] machine_kexec+522 at ffffffffaca60b2a
 #1 [ffff9a43567d35b0] __crash_kexec+114 at ffffffffacb13512
 #2 [ffff9a43567d3680] crash_kexec+48 at ffffffffacb13600
 #3 [ffff9a43567d3698] oops_end+168 at ffffffffad117768
 #4 [ffff9a43567d36c0] no_context+645 at ffffffffad106f52
 #5 [ffff9a43567d3710] __bad_area_nosemaphore+116 at ffffffffad106fe9
 #6 [ffff9a43567d3760] bad_area+70 at ffffffffad107379
 #7 [ffff9a43567d3788] __do_page_fault+1247 at ffffffffad11a8cf
 #8 [ffff9a43567d37f0] do_page_fault+53 at ffffffffad11a915
 #9 [ffff9a43567d3820] page_fault+40 at ffffffffad116768
    [exception RIP: qedf_init_task+61]
    RIP: ffffffffc0e13c2d  RSP: ffff9a43567d38d0  RFLAGS: 00010046
    RAX: 0000000000000000  RBX: ffffbe920472c738  RCX: ffff9a434fa0e3e8
    RDX: ffff9a434f695280  RSI: ffffbe920472c738  RDI: ffff9a43aa359c80
    RBP: ffff9a43567d3950   R8: 0000000000000c15   R9: ffff9a3fb09b9880
    R10: ffff9a434fa0e3e8  R11: ffff9a43567d35ce  R12: 0000000000000000
    R13: ffff9a434f695280  R14: ffff9a43aa359c80  R15: ffff9a3fb9e005c0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018

Signed-off-by: Chad Dupuis <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
liyi-ibm pushed a commit that referenced this issue Dec 28, 2018
commit 89da619 upstream.

Kernel panic when with high memory pressure, calltrace looks like,

PID: 21439 TASK: ffff881be3afedd0 CPU: 16 COMMAND: "java"
 #0 [ffff881ec7ed7630] machine_kexec at ffffffff81059beb
 #1 [ffff881ec7ed7690] __crash_kexec at ffffffff81105942
 #2 [ffff881ec7ed7760] crash_kexec at ffffffff81105a30
 #3 [ffff881ec7ed7778] oops_end at ffffffff816902c8
 #4 [ffff881ec7ed77a0] no_context at ffffffff8167ff46
 #5 [ffff881ec7ed77f0] __bad_area_nosemaphore at ffffffff8167ffdc
 #6 [ffff881ec7ed7838] __node_set at ffffffff81680300
 #7 [ffff881ec7ed7860] __do_page_fault at ffffffff8169320f
 #8 [ffff881ec7ed78c0] do_page_fault at ffffffff816932b5
 #9 [ffff881ec7ed78f0] page_fault at ffffffff8168f4c8
    [exception RIP: _raw_spin_lock_irqsave+47]
    RIP: ffffffff8168edef RSP: ffff881ec7ed79a8 RFLAGS: 00010046
    RAX: 0000000000000246 RBX: ffffea0019740d00 RCX: ffff881ec7ed7fd8
    RDX: 0000000000020000 RSI: 0000000000000016 RDI: 0000000000000008
    RBP: ffff881ec7ed79a8 R8: 0000000000000246 R9: 000000000001a098
    R10: ffff88107ffda000 R11: 0000000000000000 R12: 0000000000000000
    R13: 0000000000000008 R14: ffff881ec7ed7a80 R15: ffff881be3afedd0
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018

It happens in the pagefault and results in double pagefault
during compacting pages when memory allocation fails.

Analysed the vmcore, the page leads to second pagefault is corrupted
with _mapcount=-256, but private=0.

It's caused by the race between migration and ballooning, and lock
missing in virtballoon_migratepage() of virtio_balloon driver.
This patch fix the bug.

Fixes: e225042 ("virtio_balloon: introduce migration primitives to balloon pages")
Cc: [email protected]
Signed-off-by: Jiang Biao <[email protected]>
Signed-off-by: Huang Chong <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
liyi-ibm pushed a commit that referenced this issue Dec 28, 2018
[ Upstream commit 934140a ]

cachefiles_read_waiter() has the right to access a 'monitor' object by
virtue of being called under the waitqueue lock for one of the pages in its
purview.  However, it has no ref on that monitor object or on the
associated operation.

What it is allowed to do is to move the monitor object to the operation's
to_do list, but once it drops the work_lock, it's actually no longer
permitted to access that object.  However, it is trying to enqueue the
retrieval operation for processing - but it can only do this via a pointer
in the monitor object, something it shouldn't be doing.

If it doesn't enqueue the operation, the operation may not get processed.
If the order is flipped so that the enqueue is first, then it's possible
for the work processor to look at the to_do list before the monitor is
enqueued upon it.

Fix this by getting a ref on the operation so that we can trust that it
will still be there once we've added the monitor to the to_do list and
dropped the work_lock.  The op can then be enqueued after the lock is
dropped.

The bug can manifest in one of a couple of ways.  The first manifestation
looks like:

 FS-Cache:
 FS-Cache: Assertion failed
 FS-Cache: 6 == 5 is false
 ------------[ cut here ]------------
 kernel BUG at fs/fscache/operation.c:494!
 RIP: 0010:fscache_put_operation+0x1e3/0x1f0
 ...
 fscache_op_work_func+0x26/0x50
 process_one_work+0x131/0x290
 worker_thread+0x45/0x360
 kthread+0xf8/0x130
 ? create_worker+0x190/0x190
 ? kthread_cancel_work_sync+0x10/0x10
 ret_from_fork+0x1f/0x30

This is due to the operation being in the DEAD state (6) rather than
INITIALISED, COMPLETE or CANCELLED (5) because it's already passed through
fscache_put_operation().

The bug can also manifest like the following:

 kernel BUG at fs/fscache/operation.c:69!
 ...
    [exception RIP: fscache_enqueue_operation+246]
 ...
 #7 [ffff883fff083c10] fscache_enqueue_operation at ffffffffa0b793c6
 #8 [ffff883fff083c28] cachefiles_read_waiter at ffffffffa0b15a48
 #9 [ffff883fff083c48] __wake_up_common at ffffffff810af028

I'm not entirely certain as to which is line 69 in Lei's kernel, so I'm not
entirely clear which assertion failed.

Fixes: 9ae326a ("CacheFiles: A cache that backs onto a mounted filesystem")
Reported-by: Lei Xue <[email protected]>
Reported-by: Vegard Nossum <[email protected]>
Reported-by: Anthony DeRobertis <[email protected]>
Reported-by: NeilBrown <[email protected]>
Reported-by: Daniel Axtens <[email protected]>
Reported-by: Kiran Kumar Modukuri <[email protected]>
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Daniel Axtens <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant