feat(bpf): use time window for bpf sampling to replace per call based sampling #1723

rootfs · 2024-08-22T01:18:48Z

From @vimalk78 finding, the per call based bpf sampling has very large cpu time variations.

Now changing to time window based sampling. The cpu time is much consistent and close to the probing results, while the overhead is reduced even more.

Disclaimer: some of the code is generated by ChatGPT.

Active Time (ms)	Idle Time (ms)	Average `kepler_sched_switch_trace` bpf runtime (ns)
5	95	400
20	80	875
50	50	1500
80	20	2100
1000 (default)	0	2500

github-actions · 2024-08-22T01:19:27Z

🤖 SeineSailor

Here is a concise summary of the pull request changes:

Summary: This pull request introduces significant changes to the BPF (Berkeley Packet Filter) implementation, replacing per-call sampling with time window-based sampling. This new approach reduces CPU time variation and overhead. Additionally, a minor internal change is made to the dcgm.Init() function call.

Key Modifications:

Time Window-Based Sampling for BPF: The pull request replaces per-call sampling with time window-based sampling, reducing CPU time variation and overhead. This change affects multiple files, including kepler.bpf.h, exporter.go, kepler_bpfeb.go, kepler_bpfel.go, config.go, test_bpfeb.go, and test_bpfel.go.
Global Parameters and BPF Maps: New global parameters for tracking and non-tracking periods are added, along with two BPF maps to manage the tracking state.
Updated do_kepler_sched_switch_trace Function: The function now checks a tracking flag and updates the sampling state based on elapsed time.
Minor Internal Change: The dcgm.Init() function call is updated to use config.GetDCGMHostEngineEndpoint() instead of config.DCGMHostEngineEndpoint.

Impact on Codebase:

The BPF implementation is significantly altered, but the external interface remains unchanged.
The code generated by ChatGPT may require further review.

Suggestions for Improvement:

It would be beneficial to include more detailed comments or documentation explaining the reasoning behind the changes and how they improve the BPF implementation.
Consider adding tests to verify the correctness of the new time window-based sampling approach.
Review the code generated by ChatGPT to ensure it meets the project's coding standards and best practices.

rootfs · 2024-08-22T15:22:14Z

converting to draft, pending test results.

vimalk78 · 2024-09-11T16:55:20Z

Test results:

Below is a comparison of two keplers, one with sampling window enabled (100 ms active, 1000 ms idle), other without sampling.

We can see that on bare metal, the two keplers produce very close values for package power and core power, because the ratio of bpf cpu time, with sampling, is very close to without sampling.

process cpu time, exhaustive vs sampling

process core joules, exhaustive vs sampling

process package joules, exhaustive vs sampling

kepler cpu time, exhaustive vs sampling

As expected, the kepler with sampling consumes less cpu time, and less cpu instructions compared to kepler without sampling.

rootfs · 2024-09-16T20:30:32Z

@dave-tucker @sthaha @marceloamaral PTAL, thanks

marceloamaral · 2024-09-17T01:38:42Z

@rootfs @dave-tucker I’m concerned about the impact on VM power estimation. If we're undercounting CPU time, the power consumption will be underestimated as well.

To address this, we need to extrapolate the results, similar to how Linux handles counter multiplexing. For instance, if we collected data for only 1 second out of a 5-second window, we should multiply the results by 5 to estimate for the full 5 seconds.

All we need to do is track the collection interval and adjust the results accordingly.

dave-tucker

Some comments on the code.

I'm not going to repeat myself, so here is my canned response: #1685 (review)

@vimalk78 I'm not surprised the totals or the estimation are pretty much the same.
But can you run the same test again but show the distribution of CPU time for each process on the system?

My bet is we will find that per-process cpu time, cpu instructions etc.. are totally off.
Comparing to top or schapandre etc.. would yield very different results.

Given we only care about cpu time (at the moment), the most efficient option is to not use eBPF at all and just read utime from /proc/$pid/stat

dave-tucker · 2024-08-23T09:51:04Z

bpf/kepler.bpf.h

+
+// BPF map to track whether we are in the tracking period or not
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);


consider using a PERCPU_ARRAY since this would have the effect of making the time window per-cpu also, which may be desirable.

changed to PERCPU_ARRAY

dave-tucker · 2024-08-23T09:54:03Z

bpf/kepler.bpf.h

-			counter_sched_switch--;
+	// Retrieve tracking flag and start time
+	u32 key = 0;
+	u32 *tracking_flag = bpf_map_lookup_elem(&tracking_flag_map, &key);


given that map lookups are the most expensive part of the eBPF code it would be better to reduce them where possible. there's no reason to store tracking_flag in a map as far as I can tell since it's value doesn't need to persist between invocations.

there is a thinking of using kepler userspace program to set the tracking flag. The actual mechanism is not quite clear yet. Will remove this map if that is a dead end.

dave-tucker · 2024-08-23T10:00:42Z

pkg/config/config.go

+	CPUArchOverride         = getConfig("CPU_ARCH_OVERRIDE", "")
+	MaxLookupRetry          = getIntConfig("MAX_LOOKUP_RETRY", defaultMaxLookupRetry)
+	BPFSampleRate           = getIntConfig("EXPERIMENTAL_BPF_SAMPLE_RATE", 0)
+	BPFActiveSampleWindowMS = getIntConfig("EXPERIMENTAL_BPF_ACTIVE_SAMPLE_WINDOW_MS", 1000)


Why are the default values in the code 20 and 80, but here they are 1000 and 0?
If this is for coexistence with the other sampling feature it may be easier to set them all to 0 and update the eBPF code to only evaluate this code path if both ACTIVE and IDLE values are > 0.

rootfs · 2024-09-17T13:36:32Z

@marceloamaral good point! at the moment, the sampled cpu time is not extrapolated. We can consider different scaling factors. One approach in my plan is to find the max and min cpu time from each sample, and use the mean cpu time to extrapolate the entire active + idle duration. This will account for the variable cpu utilization conditions. If that proves effective, we then will discuss removing the EXPERIMENTAL prefix from these params. wdyt?

… sampling Signed-off-by: Huamin Chen <[email protected]>

rootfs requested review from sthaha, vimalk78, dave-tucker, marceloamaral and sunya-ch August 22, 2024 01:18

rootfs marked this pull request as draft August 22, 2024 15:22

rootfs force-pushed the sample-window branch 6 times, most recently from 113e3bc to d88e8b8 Compare September 16, 2024 19:53

rootfs marked this pull request as ready for review September 16, 2024 20:30

dave-tucker requested changes Sep 17, 2024

View reviewed changes

feat(bpf): use time window for bpf sampling to replace per call based…

5c3f51e

… sampling Signed-off-by: Huamin Chen <[email protected]>

rootfs force-pushed the sample-window branch from d88e8b8 to 5c3f51e Compare September 17, 2024 13:57

rootfs force-pushed the sample-window branch from 6176d95 to 5c3f51e Compare November 19, 2024 21:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bpf): use time window for bpf sampling to replace per call based sampling #1723

feat(bpf): use time window for bpf sampling to replace per call based sampling #1723

rootfs commented Aug 22, 2024 •

edited

Loading

github-actions bot commented Aug 22, 2024 •

edited

Loading

rootfs commented Aug 22, 2024

vimalk78 commented Sep 11, 2024

rootfs commented Sep 16, 2024

marceloamaral commented Sep 17, 2024

dave-tucker left a comment

dave-tucker Aug 23, 2024

rootfs Sep 17, 2024

dave-tucker Aug 23, 2024

rootfs Sep 17, 2024

dave-tucker Aug 23, 2024

rootfs Sep 17, 2024

rootfs commented Sep 17, 2024

feat(bpf): use time window for bpf sampling to replace per call based sampling #1723

Are you sure you want to change the base?

feat(bpf): use time window for bpf sampling to replace per call based sampling #1723

Conversation

rootfs commented Aug 22, 2024 • edited Loading

github-actions bot commented Aug 22, 2024 • edited Loading

rootfs commented Aug 22, 2024

vimalk78 commented Sep 11, 2024

rootfs commented Sep 16, 2024

marceloamaral commented Sep 17, 2024

dave-tucker left a comment

Choose a reason for hiding this comment

dave-tucker Aug 23, 2024

Choose a reason for hiding this comment

rootfs Sep 17, 2024

Choose a reason for hiding this comment

dave-tucker Aug 23, 2024

Choose a reason for hiding this comment

rootfs Sep 17, 2024

Choose a reason for hiding this comment

dave-tucker Aug 23, 2024

Choose a reason for hiding this comment

rootfs Sep 17, 2024

Choose a reason for hiding this comment

rootfs commented Sep 17, 2024

rootfs commented Aug 22, 2024 •

edited

Loading

github-actions bot commented Aug 22, 2024 •

edited

Loading