Skip to content

Conversation

cloehle
Copy link
Contributor

@cloehle cloehle commented Sep 30, 2025

The task in the qmap may be migration disabled, don't dispatch it to a local DSQ in that case but bounce it back to the fallback, too.

See
#2825

The task in the qmap may be migration disabled, don't dispatch it
to a local DSQ in that case but bounce it back to the fallback, too.

Signed-off-by: Christian Loehle <[email protected]>
@cloehle
Copy link
Contributor Author

cloehle commented Sep 30, 2025

So @arighi this is the obvious fix for the reported issue for central, but it still triggers (although much more rarely):

[SEQ 1765]
total   :  63714331    local:     90352   queued:         1  lost:         0
timer   :   1913227 dispatch:  36556859 mismatch:     10346 retry:         0
overflow:         0

DEBUG DUMP
================================================================================

git-remote-http[3140001] triggered exit kind 1024:
  runtime error (SCX_DSQ_LOCAL[_ON] cannot move migration disabled git-remote-http[3140001] from CPU 1 to 0)

Backtrace:
  scx_exit+0x58/0x84
  task_can_run_on_remote_rq+0x16c/0x1a0
  dispatch_to_local_dsq+0x7c/0x204
  flush_dispatch_buf+0x208/0x218
  balance_scx+0x254/0x490
  __schedule+0x540/0xf6c
  schedule_idle+0x28/0x48
  do_idle+0x194/0x28c
  cpu_startup_entry+0x34/0x3c
  rest_init+0xfc/0x18c
  start_kernel+0x720/0x7e8
  __primary_switched+0x88/0x90

CPU states
----------

CPU 0   : nr_run=1 flags=0x3 cpu_rel=0 ops_qseq=82753261 pnt_seq=93459
          curr=git-remote-http[3140001] class=ext_sched_class
  idle_to_kick   : 010

 *R git-remote-http[3140001] +0ms
      scx_state/flags=3/0x5 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      dsq_vtime=0 slice=18446744073709551615 weight=100
      cpus=fff

    scx_dump_state+0x7d0/0x924
    scx_error_irq_workfn+0x4c/0x68
    irq_work_single+0x6c/0xa8
    irq_work_run_list+0x4c/0x68
    irq_work_run+0x38/0x5c
    ipi_handler+0x20c/0x334
    handle_percpu_devid_irq+0xc0/0x208
    handle_irq_desc+0x40/0x58
    generic_handle_domain_irq+0x1c/0x28
    gic_handle_irq+0x4c/0x11c
    call_on_irq_stack+0x30/0x48
    do_interrupt_handler+0xd4/0xd8
    el1_interrupt+0x34/0x64
    el1h_64_irq_handler+0x18/0x24
    el1h_64_irq+0x6c/0x70
    finish_task_switch.isra.0+0xbc/0x2c8
    anon_pipe_write+0x50/0x4ec
    vfs_write+0x310/0x370
    ksys_write+0xec/0x108
    __arm64_sys_write+0x1c/0x28
    invoke_syscall+0x48/0x110
    el0_svc_common.constprop.0+0x40/0xe0
    do_el0_svc+0x1c/0x28
    el0_svc+0x110/0x160
    el0t_64_sync_handler+0xa0/0xe4
    el0t_64_sync+0x17c/0x180

CPU 2   : nr_run=1 flags=0x3 cpu_rel=0 ops_qseq=2681279 pnt_seq=979267
          curr=git[3140093] class=ext_sched_class

 *R git[3140093] -4ms
      scx_state/flags=3/0x5 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      dsq_vtime=0 slice=18446744073709551615 weight=100
      cpus=fff

CPU 3   : nr_run=1 flags=0x3 cpu_rel=0 ops_qseq=2688836 pnt_seq=979891
          curr=git[3140094] class=ext_sched_class

 *R git[3140094] -4ms
      scx_state/flags=3/0x5 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      dsq_vtime=0 slice=18446744073709551615 weight=100
      cpus=fff

    __lock_acquire+0x41c/0x1f90
    lock_acquire+0x1d4/0x35c
    0xffff00008d799cc0

Event counters
--------------
              SCX_EV_SELECT_CPU_FALLBACK:                0
       SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE:                0
               SCX_EV_DISPATCH_KEEP_LAST:                0
                 SCX_EV_ENQ_SKIP_EXITING:            18567
      SCX_EV_ENQ_SKIP_MIGRATION_DISABLED:              311
                 SCX_EV_REFILL_SLICE_DFL:            18878
                  SCX_EV_BYPASS_DURATION:         42471590
                  SCX_EV_BYPASS_DISPATCH:                0
                  SCX_EV_BYPASS_ACTIVATE:                1

================================================================================

EXIT: runtime error (SCX_DSQ_LOCAL[_ON] cannot move migration disabled git-remote-http[3140001] from CPU 1 to 0)

or another one:

[SEQ 676]
total   :   2199383    local:     78867   queued:         0  lost:         0
timer   :    679883 dispatch:   1434583 mismatch:       703 retry:         0
overflow:         0

DEBUG DUMP
================================================================================

gdm-session-wor[3221019] triggered exit kind 1024:
  runtime error (SCX_DSQ_LOCAL[_ON] cannot move migration disabled gdm-session-wor[3221019] from CPU 4 to 9)

Backtrace:
  scx_exit+0x58/0x84
  task_can_run_on_remote_rq+0x16c/0x1a0
  dispatch_to_local_dsq+0x7c/0x204
  flush_dispatch_buf+0x208/0x218
  scx_bpf_dsq_move_to_local+0x60/0x108
  bpf_prog_9d0dfc0886e0eb02_central_dispatch+0x1a8/0x278
  bpf__sched_ext_ops_dispatch+0x50/0x74
  balance_scx+0x220/0x490
  __schedule+0x540/0xf6c
  schedule+0x48/0x15c
  schedule_hrtimeout_range_clock+0xe8/0x128
  schedule_hrtimeout_range+0x14/0x20
  poll_schedule_timeout.constprop.0+0x4c/0x9c
  do_sys_poll+0x494/0x554
  __arm64_sys_ppoll+0xac/0x138
  invoke_syscall+0x48/0x110
  el0_svc_common.constprop.0+0x40/0xe0
  do_el0_svc+0x1c/0x28
  el0_svc+0x110/0x160
  el0t_64_sync_handler+0xa0/0xe4
  el0t_64_sync+0x17c/0x180

CPU states
----------

CPU 0   : nr_run=3 flags=0x3 cpu_rel=0 ops_qseq=84542076 pnt_seq=94264
          curr=gdm-session-wor[3221019] class=ext_sched_class
  idle_to_kick   : 200

 *R gdm-session-wor[3221019] +0ms
      scx_state/flags=3/0x5 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      dsq_vtime=0 slice=0 weight=100
      cpus=fff

    scx_dump_state+0x7d0/0x924
    scx_error_irq_workfn+0x4c/0x68
    irq_work_single+0x6c/0xa8
    irq_work_run_list+0x4c/0x68
    irq_work_run+0x38/0x5c
    ipi_handler+0x20c/0x334
    handle_percpu_devid_irq+0xc0/0x208
    handle_irq_desc+0x40/0x58
    generic_handle_domain_irq+0x1c/0x28
    gic_handle_irq+0x4c/0x11c
    call_on_irq_stack+0x30/0x48
    do_interrupt_handler+0xd4/0xd8
    el1_interrupt+0x34/0x64
    el1h_64_irq_handler+0x18/0x24
    el1h_64_irq+0x6c/0x70
    finish_task_switch.isra.0+0xbc/0x2c8
    rcu_is_watching+0x5c/0x70
    lock_release+0x264/0x338
    tty_release_struct+0x78/0x8c
    tty_release+0x3b8/0x550
    __fput+0xcc/0x2dc
    fput_close_sync+0x40/0x110
    __arm64_sys_close+0x38/0x7c
    invoke_syscall+0x48/0x110
    el0_svc_common.constprop.0+0x40/0xe0
    do_el0_svc+0x1c/0x28
    el0_svc+0x110/0x160
    el0t_64_sync_handler+0xa0/0xe4
    el0t_64_sync+0x17c/0x180

  R NetworkManager[560] +0ms
      scx_state/flags=3/0x9 dsq_flags=0x0 ops_state/qseq=2/84542074
      sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      dsq_vtime=0 slice=0 weight=100
      cpus=fff

    do_sys_poll+0x494/0x554
    __arm64_sys_ppoll+0xac/0x138
    invoke_syscall+0x48/0x110
    el0_svc_common.constprop.0+0x40/0xe0
    do_el0_svc+0x1c/0x28
    el0_svc+0x110/0x160
    el0t_64_sync_handler+0xa0/0xe4
    el0t_64_sync+0x17c/0x180

  R gdbus[448] +0ms
      scx_state/flags=3/0x9 dsq_flags=0x0 ops_state/qseq=2/84542075
      sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      dsq_vtime=0 slice=18446744073709551615 weight=100
      cpus=fff

    do_sys_poll+0x494/0x554
    __arm64_sys_ppoll+0xac/0x138
    invoke_syscall+0x48/0x110
    el0_svc_common.constprop.0+0x40/0xe0
    do_el0_svc+0x1c/0x28
    el0_svc+0x110/0x160
    el0t_64_sync_handler+0xa0/0xe4
    el0t_64_sync+0x17c/0x180

CPU 1   : nr_run=1 flags=0x3 cpu_rel=0 ops_qseq=2718437 pnt_seq=1012235
          curr=ld[3220978] class=ext_sched_class

 *R ld[3220978] -24ms
      scx_state/flags=3/0x5 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      dsq_vtime=0 slice=18446744073709551615 weight=100
      cpus=fff

    0xffff800090cc3a70

CPU 2   : nr_run=1 flags=0x3 cpu_rel=0 ops_qseq=2717389 pnt_seq=1037269
          curr=cix_audio_switc[3221030] class=ext_sched_class

 *S cix_audio_switc[3221030] +0ms
      scx_state/flags=3/0x5 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      dsq_vtime=0 slice=18446744073709551615 weight=100
      cpus=fff

CPU 3   : nr_run=1 flags=0x3 cpu_rel=0 ops_qseq=2723810 pnt_seq=1038160
          curr=gdbus[3142960] class=ext_sched_class

 *R gdbus[3142960] +0ms
      scx_state/flags=3/0x5 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      dsq_vtime=0 slice=18446744073709551615 weight=100
      cpus=fff

CPU 6   : nr_run=1 flags=0x3 cpu_rel=0 ops_qseq=10665451 pnt_seq=3947563
          curr=systemd-logind[468] class=ext_sched_class

 *R systemd-logind[468] +0ms
      scx_state/flags=3/0x5 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      dsq_vtime=0 slice=18446744073709551615 weight=100
      cpus=fff

    vfs_rename+0x554/0xa20
    do_renameat2+0x318/0x4e8
    __arm64_sys_renameat+0x50/0x68
    invoke_syscall+0x48/0x110
    el0_svc_common.constprop.0+0xc0/0xe0
    do_el0_svc+0x1c/0x28
    el0_svc+0x110/0x160
    el0t_64_sync_handler+0xa0/0xe4
    el0t_64_sync+0x17c/0x180

Event counters
--------------
              SCX_EV_SELECT_CPU_FALLBACK:                0
       SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE:                0
               SCX_EV_DISPATCH_KEEP_LAST:                0
                 SCX_EV_ENQ_SKIP_EXITING:            29390
      SCX_EV_ENQ_SKIP_MIGRATION_DISABLED:              457
                 SCX_EV_REFILL_SLICE_DFL:            29858
                  SCX_EV_BYPASS_DURATION:         88182574
                  SCX_EV_BYPASS_DISPATCH:               11
                  SCX_EV_BYPASS_ACTIVATE:                1

================================================================================

EXIT: runtime error (SCX_DSQ_LOCAL[_ON] cannot move migration disabled gdm-session-wor[3221019] from CPU 4 to 9)

@cloehle cloehle marked this pull request as draft September 30, 2025 10:13
@arighi
Copy link
Contributor

arighi commented Sep 30, 2025

Thanks @cloehle , out of curiousity can you reproduce this also with scx_tickless?

* bounce it to the fallback dsq.
*/
if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) {
if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr) || is_migration_disabled(p)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're in ops.dispatch() I think we need to check p->migration_disabled directly here. Can you try this instead of using is_migration_disabled()?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants