Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose BPF map kernel memory use by tracing policy #2984

Merged
merged 14 commits into from
Oct 14, 2024

Conversation

mtardy
Copy link
Member

@mtardy mtardy commented Oct 7, 2024

This allows users to understand how much kernel memory the BPF maps loaded by the program loaded by their TracingPolicy consume. This information can be retrieved from the gRPC API, by using tetra for example, or from the metrics server.

gRPC API with tetra

$ tetra tp list
ID   NAME                 STATE     FILTERID   NAMESPACE   SENSORS          KERNELMEMORY
1    file-monitoring      enabled   0          (global)    generic_kprobe   5.20 MB
2    sys-symlink-passwd   enabled   0          (global)    generic_kprobe   3.54 MB
3    accept               enabled   0          (global)    generic_kprobe   5.87 MB
4    connect              enabled   0          (global)    generic_kprobe   4.46 MB

Metrics server using curl

$ curl localhost:2112/metrics | grep kernel_memory
# HELP tetragon_tracingpolicy_kernel_memory_bytes The amount of kernel memory in bytes used by policy's sensors non-shared BPF maps (memlock).
# TYPE tetragon_tracingpolicy_kernel_memory_bytes gauge
tetragon_tracingpolicy_kernel_memory_bytes{policy="accept"} 5.865864e+06
tetragon_tracingpolicy_kernel_memory_bytes{policy="connect"} 4.458824e+06
tetragon_tracingpolicy_kernel_memory_bytes{policy="file-monitoring"} 5.197112e+06
tetragon_tracingpolicy_kernel_memory_bytes{policy="sys-symlink-passwd"} 3.543976e+06

Note: this PR looks a little bit like #2090.

Expose BPF map kernel memory use by tracing policy via the gRPC API and the metrics. Use `tetra tp list` to see the breakdown of BPF map memory use by policy or look for the `tetragon_tracingpolicy_kernel_memory_bytes` metric.

@mtardy mtardy added the release-note/minor This PR introduces a minor user-visible change label Oct 7, 2024
Copy link

netlify bot commented Oct 7, 2024

Deploy Preview for tetragon ready!

Name Link
🔨 Latest commit 4a92d64
🔍 Latest deploy log https://app.netlify.com/sites/tetragon/deploys/67056c4865969f0008d97c19
😎 Deploy Preview https://deploy-preview-2984--tetragon.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@mtardy mtardy force-pushed the pr/mtardy/memory-metrics-policy branch 3 times, most recently from 30c69ba to 07ae544 Compare October 8, 2024 15:17
In order to specify another path than the base path, for example to only
look for a specific policy, or sensor, or program thanks to the
hierarchical organization of the Tetragon BPF fs.

Signed-off-by: Mahe Tardy <[email protected]>
@mtardy mtardy force-pushed the pr/mtardy/memory-metrics-policy branch from 07ae544 to 46ef8f4 Compare October 8, 2024 15:40
mtardy added 3 commits October 8, 2024 19:22
Also add _bytes suffix to memlock field for precision. This is mostly
for consistency and ease of use.

Signed-off-by: Mahe Tardy <[email protected]>
Introduce a new field that will be used in the next commit to inform the
user how much kernel memory is roughly used by the policy.

Signed-off-by: Mahe Tardy <[email protected]>
@mtardy mtardy force-pushed the pr/mtardy/memory-metrics-policy branch 2 times, most recently from 4a92d64 to 582b2f8 Compare October 8, 2024 17:35
@mtardy mtardy marked this pull request as ready for review October 8, 2024 17:35
@mtardy mtardy requested a review from a team as a code owner October 8, 2024 17:35
@mtardy mtardy requested a review from kkourt October 8, 2024 17:35
pkg/sensors/program/map.go Outdated Show resolved Hide resolved
@mtardy mtardy changed the title Add kernel memory use by tracing policy Expose BPF map kernel memory use by tracing policy Oct 8, 2024
@mtardy mtardy force-pushed the pr/mtardy/memory-metrics-policy branch 3 times, most recently from 2ba13bb to e84d69a Compare October 9, 2024 11:08
@mtardy mtardy requested a review from olsajiri October 9, 2024 17:40
Copy link
Contributor

@kkourt kkourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks great!

Please find some comments below.

cmd/tetra/common/utils.go Show resolved Hide resolved
pkg/sensors/program/loader.go Show resolved Hide resolved
pkg/sensors/manager_test.go Show resolved Hide resolved
pkg/sensors/program/map.go Outdated Show resolved Hide resolved
For improved readability. Use the JSON output in both commands for
accurate results.

Now the output is adapted to the nearest sensible unit:
	[...]   NAMESPACE   SENSORS          KERNELMEMORY
	[...]   (global)    generic_kprobe   4.46 MB
	[...]   (global)    generic_kprobe   5.20 MB
	[...]   (global)    generic_kprobe   3.54 MB
	[...]   (global)    generic_kprobe   5.87 MB

Signed-off-by: Mahe Tardy <[email protected]>
These operations might eventually be taken care of by cilium/ebpf
directly and we can remove this, see 21be59580a51 ("map: add map_extra,
memlock, frozen to MapInfo").

I moved this to the bpf package as it made little sense to import
pkg/bugtool to just perform those operations on maps.

Signed-off-by: Mahe Tardy <[email protected]>
We keep the set of maps that are loaded for a program so that we can do
accounting on the memory use of those maps. Not doing the accounting
immediately allows for more flexibility in computing which part of the
memory is actually shared between multiple programs/sensors.

Note that we do that at the lower level of the loader because reading
the Program.Maps, we might miss some of them (like heap maps for
example). Indeed, the one you can list loaded from loadMaps and such are
the one that we might pin for:
* sharing them amongst programs
* using them in userspace

Signed-off-by: Mahe Tardy <[email protected]>
Retrieve memlock from each map loaded by a sensor.

Signed-off-by: Mahe Tardy <[email protected]>
@mtardy mtardy force-pushed the pr/mtardy/memory-metrics-policy branch 2 times, most recently from 0cbc81a to 08f6962 Compare October 10, 2024 15:58
Avoid to double count for shared maps, like the execve_maps for example.

I decided keep a record of global maps for accounting.  At first I
implemented it by reading the global BPF fs directory, this might use
too much CPU since we should be aware of what are the current global
maps since we load them ourselves, so instead I hooked in the loading of
global maps.

Signed-off-by: Mahe Tardy <[email protected]>
This will now be used not only for the state but also for the kernel
memory so I'm making the name more generic.

Signed-off-by: Mahe Tardy <[email protected]>
Similarly to TracingPolicy state, you can retrieve this information
through the gRPC API by listing the policies or using the metrics
interface.

Signed-off-by: Mahe Tardy <[email protected]>
Fix an unexpected panic along.

Signed-off-by: Mahe Tardy <[email protected]>
So that this code is triggered via tests and doesn't fail silently if
cilium/ebpf changes its behavior, like suggested in cilium/ebpf#1566.

Signed-off-by: Mahe Tardy <[email protected]>
@mtardy mtardy force-pushed the pr/mtardy/memory-metrics-policy branch from 08f6962 to cbb1b35 Compare October 10, 2024 16:00
@mtardy mtardy requested a review from kkourt October 10, 2024 16:01
@jrfastab jrfastab merged commit fc59f62 into main Oct 14, 2024
46 checks passed
@jrfastab jrfastab deleted the pr/mtardy/memory-metrics-policy branch October 14, 2024 16:38
@@ -63,6 +70,7 @@ func collect(ch chan<- prometheus.Metric) {
for _, policy := range list.Policies {
state := policy.State
counters[state]++
ch <- policyKernelMemory.MustMetric(float64(policy.KernelMemoryBytes), policy.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to report this metric for policies that are disabled or in error state?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmh I haven't tried, hopefully it should show up as zero if it's disabled since progs are unloaded and similarly for error 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I tried and the memory state is actually not refreshed properly on disabled, it needs a fix, thanks for the remark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/minor This PR introduces a minor user-visible change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants