From e3120e460aebcd3d5d29ec61d25f420bb28bcabd Mon Sep 17 00:00:00 2001
From: Florian Lehner <florianl@users.noreply.github.com>
Date: Mon, 28 Oct 2024 11:23:17 +0100
Subject: [PATCH] [RFC] design doc: proposal for off-cpu profiling (#144)

Signed-off-by: Florian Lehner <dev@der-flo.net>
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
---
 design-docs/00001-off-cpu-profiling/README.md | 213 ++++++++++++++++++
 1 file changed, 213 insertions(+)
 create mode 100644 design-docs/00001-off-cpu-profiling/README.md

diff --git a/design-docs/00001-off-cpu-profiling/README.md b/design-docs/00001-off-cpu-profiling/README.md
new file mode 100644
index 00000000..ccde893c
--- /dev/null
+++ b/design-docs/00001-off-cpu-profiling/README.md
@@ -0,0 +1,213 @@
+Off-CPU Profiling
+=============================
+
+# Meta
+
+- **Author(s)**: Florian Lehner
+- **Start Date**: 2024-06-01
+- **Goal End Date**: 2024-10-31
+- **Primary Reviewers**: https://github.com/orgs/open-telemetry/teams/ebpf-profiler-maintainers
+
+# Abstract
+
+The OTel Profiling Agent, while effective for on-CPU profiling, faces limitations in identifying
+application blockages that introduce latency.
+
+```mermaid
+gantt
+     dateFormat SSS
+     axisFormat %L
+     title Database query of 100ms
+     section Thread Execution
+     On-CPU: on, 0, 20ms
+     Off-CPU: after on, 80ms
+```
+Latency impact example[^1].
+
+To address this, the OTel Profiling Agent should extend its capabilities to include off-CPU
+profiling. By combining on-CPU and off-CPU profiling, the OTel Profiling Agent can provide a more
+comprehensive understanding of application and system performance. This enables identifying
+bottlenecks and optimization for resource utilization, which leads to reduced energy consumption
+and a smaller environmental footprint.
+
+# Scope
+
+This document focuses on the hook points and the additional value that off-CPU profiling can provide
+to the OTel Profiling Agent.
+
+## Success criteria
+
+The OTel Profiling Agent should be extended in a way, that existing profiling and stack unwinding
+capabilities are reused to enable off-CPU profiling. Off-CPU profiling should be an optional
+feature, that can be enabled additional to sampling based on-CPU profiling.
+
+## Non-success criteria
+
+Off-CPU profiling is not a replacement for dedicated disk I/O, memory allocation, network I/O, lock
+contention or other specific performance topics. It can just be the indicator to investigate further
+into dedicated areas.
+
+Visualization and analysis of the off-CPU profiling information as well as correlating this data
+with on-CPU profiling information is not within the scope of this proposal.
+
+# Proposal
+
+The OTel Profiling Agent is a sampling based profiler that utilizes the perf subsystem as entry
+point for frequent stack unwinding. By default a sampling frequency of [20Hz](https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/dd0c20701b191975d6c13408c92d7fed637119da/cli_flags.go#L24)
+is used.
+
+The eBPF program [`perf_event/native_tracer_entry`](https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/dd0c20701b191975d6c13408c92d7fed637119da/support/ebpf/native_stack_trace.ebpf.c#L860C6-L860C36)
+is the entry program that starts the stack unwinding. To do so, it collects information like the
+data stored in the CPU registers before starting the stack unwinding routine via tail calls. The
+tail call destinations for the stack unwinding, like [`perf_event/unwind_native`](https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/dd0c20701b191975d6c13408c92d7fed637119da/support/ebpf/native_stack_trace.ebpf.c#L751),
+are generic eBPF programs that should be repurposed for off-CPU profiling.
+
+In the following proposal options are evaluated to use additional hooks as entry points for stack
+unwinding in order to enable off-CPU profiling capabilities.
+
+With tracepoints and kprobes the Linux kernel provides two mechanisms for instrumentation that allow
+to monitor and analyze the behavior of the system. To keep the impact of the profiling minimal
+tracepoints are preferred over kprobes, as the former are more performant and statically defined in
+the Linux kernel code.
+
+A potential list of all possible tracepoints in the scope of the Linux kernel scheduler can be
+retrieved with `sudo bpftrace -l 'tracepoint:sched*'`. While most of these potential tracepoints in
+the Linux kernel scheduler are specific to a process, kernel or other event, this proposal focuses
+on generic scheduler tracepoints.
+
+## Technical background
+
+It is the schedulers responsibility in the Linux kernel, to manage tasks[^2] and provide tasks with
+CPU resources. In this concept [__schedule()](https://github.com/torvalds/linux/blob/5be63fc19fcaa4c236b307420483578a56986a37/kernel/sched/core.c#L6398)
+is the central function that takes and provides CPU resources to tasks and does the CPU context
+switch.
+
+## Risks
+
+All the following proposed options face the same common challenge, that it is possible to overload
+the system by profiling every scheduling event. All proposed options mitigate this risk by
+
+1. Ignoring the schedulers idle task.
+2. Use a sampling approach to reduce the number of profiled scheduling events. The exact amount of
+   sampling should be configurable.
+
+The OTel Profiling Agent is using a technique that can be described as "lazy loading". Every time
+the eBPF program of the OTel Profiling Agent [encounters a PID that is unknown](https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/dd0c20701b191975d6c13408c92d7fed637119da/support/ebpf/native_stack_trace.ebpf.c#L845-L846),
+it informs the user space component about this new process. The entry hook for off-CPU profiling
+will also have to do this check, and inform the user space component, using the existing mechanism
+and inhibition strategy. If performance issues in this existing mechanism are noticed, the inhibition
+algorithm should be revisited and updated.
+
+## Option A
+
+Attach stack unwinding functionality to the tracepoint `tracepoint:sched:sched_switch`. This
+tracepoint is called everytime the Linux kernel scheduler takes resources from a task before
+assigning these resources to another task.
+
+Similar to the eBPF program [`perf_event/native_tracer_entry`](https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/dd0c20701b191975d6c13408c92d7fed637119da/support/ebpf/native_stack_trace.ebpf.c#L860C6-L860C36)
+a new eBPF program of type `tracepoint` needs to be written, that can act as entry point and tail
+call into the generic stack unwinding routines.
+
+### Concept
+The following [bpftrace](https://github.com/bpftrace/bpftrace) script showcases Option A:
+```bash
+#!/usr/bin/env bpftrace
+
+tracepoint:sched:sched_switch
+{
+    if (tid == 0) {
+        // Skip the idle task
+        return
+    }
+    if (rand % 100 > 3 ) {
+        // Overload prevention - make sure only 3% of scheduling events are profiled
+        return
+    }
+
+    printf("PID %d is taken off from CPU\n", pid);
+    printf("%s", ustack());
+    printf("\n");
+}
+```
+
+## Option B
+
+Use a two step method to not only get stack information but also record for how long tasks were
+taken off from CPU.
+
+In a first step use the tracepoint `tracepoint:sched:sched_switch` to record which task was taken
+off from CPU and a timestamp. In a second hook at `kprobe:finish_task_switch.isra.0` check if the
+task was seen before. If the task was seen before in the tracepoint, calculate the time the task was
+off CPU and unwind the stack. Only the second step should tail call into further stack unwinding
+routines, similar to [`perf_event/native_tracer_entry`](https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/dd0c20701b191975d6c13408c92d7fed637119da/support/ebpf/native_stack_trace.ebpf.c#L860C6-L860C36).
+To communicate tasks between the two hooks a `BPF_MAP_TYPE_LRU_HASH` eBPF map should be used with
+the return of `bpf_get_current_pid_tgid()` as key and the timestamp in nanoseconds as value.
+
+### Concept
+The following [bpftrace](https://github.com/bpftrace/bpftrace) script showcases Option B:
+```bash
+#!/usr/bin/env bpftrace
+
+tracepoint:sched:sched_switch
+{
+    if (tid == 0) {
+        // Skip the idle task
+        return
+    }
+    if (rand % 100 > 3 ) {
+        // Overload prevention - make sure only 3% of scheduling events are profiled
+        return
+    }
+    @task[tid] = nsecs;
+}
+
+kprobe:finish_task_switch.isra.0
+/@task[tid]/
+{
+    $off_start = @task[tid];
+    delete(@task[tid]);
+    printf("PID %d was off CPU for %d nsecs\n", pid, nsecs - $off_start);
+    printf("%s", ustack());
+    printf("\n");
+}
+```
+
+## Sampling vs. Aggregation
+
+Both proposed options leverage sampling techniques for off-CPU profiling. While aggregation in the
+eBPF space can potentially reduce performance overhead by communicating only aggregated data to the
+user space component, it introduces additional complexity in managing the data. Additionally it can
+be more challenging to analyze the aggregated data effectively, as it requires careful consideration
+of aggregation techniques.
+As the architecture of the stack unwinding routines in the OTel Profiling Agent are focused on a
+sampling approach, the proposed options follow this idea.
+
+## Limitations
+
+Both proposed options focus on events of the Linux kernel scheduler. Resulting data therefore is
+limited to events triggered by the Linux kernel scheduler. Scheduling events of language specific
+and language internal schedulers, like the Go runtime scheduler, are not covered by the proposed
+general approach.
+
+# Author's preference
+
+My preference is Option B, as it provides latency information additional to off-CPU stack traces,
+which is crucial for latency analysis.
+
+Option B might be a bit more complex, as it utilizes two hooks along with an eBPF map for keeping
+state across these two hooks, compared to Option A with a single hook on
+`tracepoint:sched:sched_switch`. The additional hook on `kprobe:finish_task_switch` for Option B
+might also introduce some latency, as kprobes are less performant than tracepoints. But the latency
+information along with the off-CPU stack trace justify these drawbacks from my point of view.
+
+As both options are attaching to very frequently called scheduler events, they face the same risks.
+Mitigating these risks with the [described approaches](#risks) is essential.
+
+# Decision
+
+In https://github.com/open-telemetry/opentelemetry-ebpf-profiler/pull/144 it was agreed to go
+forward with Option B.
+
+[^1]: Inspired by `Systems Performance` by Brendan Gregg, Figure 1.3 `Disk I/O latency example`.
+[^2]: The scheduler does not know about the concept of processes and process groups and treats
+everything as a task.