-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(bpf): use time window for bpf sampling to replace per call based sampling #1723
base: main
Are you sure you want to change the base?
Conversation
🤖 SeineSailor Here is a concise summary of the pull request changes: Summary: This pull request introduces significant changes to the BPF (Berkeley Packet Filter) implementation, replacing per-call sampling with time window-based sampling. This new approach reduces CPU time variation and overhead. Additionally, a minor internal change is made to the Key Modifications:
Impact on Codebase:
Suggestions for Improvement:
|
converting to draft, pending test results. |
113e3bc
to
d88e8b8
Compare
@dave-tucker @sthaha @marceloamaral PTAL, thanks |
@rootfs @dave-tucker I’m concerned about the impact on VM power estimation. If we're undercounting CPU time, the power consumption will be underestimated as well. To address this, we need to extrapolate the results, similar to how Linux handles counter multiplexing. For instance, if we collected data for only 1 second out of a 5-second window, we should multiply the results by 5 to estimate for the full 5 seconds. All we need to do is track the collection interval and adjust the results accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on the code.
I'm not going to repeat myself, so here is my canned response: #1685 (review)
@vimalk78 I'm not surprised the totals or the estimation are pretty much the same.
But can you run the same test again but show the distribution of CPU time for each process on the system?
My bet is we will find that per-process cpu time, cpu instructions etc.. are totally off.
Comparing to top
or schapandre etc.. would yield very different results.
Given we only care about cpu time (at the moment), the most efficient option is to not use eBPF at all and just read utime
from /proc/$pid/stat
|
||
// BPF map to track whether we are in the tracking period or not | ||
struct { | ||
__uint(type, BPF_MAP_TYPE_ARRAY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider using a PERCPU_ARRAY since this would have the effect of making the time window per-cpu also, which may be desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to PERCPU_ARRAY
counter_sched_switch--; | ||
// Retrieve tracking flag and start time | ||
u32 key = 0; | ||
u32 *tracking_flag = bpf_map_lookup_elem(&tracking_flag_map, &key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that map lookups are the most expensive part of the eBPF code it would be better to reduce them where possible. there's no reason to store tracking_flag
in a map as far as I can tell since it's value doesn't need to persist between invocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a thinking of using kepler userspace program to set the tracking flag. The actual mechanism is not quite clear yet. Will remove this map if that is a dead end.
pkg/config/config.go
Outdated
CPUArchOverride = getConfig("CPU_ARCH_OVERRIDE", "") | ||
MaxLookupRetry = getIntConfig("MAX_LOOKUP_RETRY", defaultMaxLookupRetry) | ||
BPFSampleRate = getIntConfig("EXPERIMENTAL_BPF_SAMPLE_RATE", 0) | ||
BPFActiveSampleWindowMS = getIntConfig("EXPERIMENTAL_BPF_ACTIVE_SAMPLE_WINDOW_MS", 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the default values in the code 20 and 80, but here they are 1000 and 0?
If this is for coexistence with the other sampling feature it may be easier to set them all to 0 and update the eBPF code to only evaluate this code path if both ACTIVE and IDLE values are > 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@marceloamaral good point! at the moment, the sampled cpu time is not extrapolated. We can consider different scaling factors. One approach in my plan is to find the max and min cpu time from each sample, and use the mean cpu time to extrapolate the entire active + idle duration. This will account for the variable cpu utilization conditions. If that proves effective, we then will discuss removing the |
… sampling Signed-off-by: Huamin Chen <[email protected]>
d88e8b8
to
5c3f51e
Compare
6176d95
to
5c3f51e
Compare
From @vimalk78 finding, the per call based bpf sampling has very large cpu time variations.
Now changing to time window based sampling. The cpu time is much consistent and close to the probing results, while the overhead is reduced even more.
Disclaimer: some of the code is generated by ChatGPT.
kepler_sched_switch_trace
bpf runtime (ns)