Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Implement the monitoring of major & minor page fault based on page fault tracepoint
Description
In the implementation of this version, our events come from 3 places: one is the initialization information of page fault, which we get from /proc, process it as data under thread granularity, and store it in thread table, and then we further encapsulate these threads as kindlingevent to maintain the consistency with the upper collector data interface; The second is from the capture of page fault events by the perf buffer. When the kernel captures a process exit event, it will send data to the user space, and we will consume these data in the collector.The third source is an eBPF Map, which reads the pagefault value in the Map at certain intervals(5 seconds by default) in user mode.
data:image/s3,"s3://crabby-images/4e63d/4e63d5d53cfd855b07be7854f86421c5873e114a" alt="image"
Motivation and Context
We need to use page fault information to judge the memory usage of system processes or threads, so that we can troubleshoot problems and faults. We provide kindling_pagefault_major_total and kindling_pagefault_ minor_total, they have the following meanings:
(1) kindling_pagefault_major_total: Indicates that the system's major page fault data is captured, such as MMAP, SWAP and other scenarios may cause the generation of major page fault, which is associated with a specific thread that generates the page fault. We identify it with 11 labels {node, namespace, workload_kind, workload_name, service, pod, container, container_id, IP, PID, tid}.
(2) kindling_pagefault_minor_total: It indicates that the minor page fault data of the system is captured. For example, using malloc function to apply for and use memory will cause minor page fault, which is also associated with a specific thread that generates the page fault. Similarly, we still use 11 labels {node, namespace, workload_kind, workload_name, service, pod, container, container_id, IP, PID, tid} to identify it.
How Has This Been Tested?
Here we use the 'stressapptest' program to increase memory pressure for the system (the tool project address is: (https://github.com/stressapptest/stressapptest. Stressful Application Test (or stressapptest, its unix name) is a memory interface test. It tries to maximize randomized traffic to memory from processor and I/O, with the intent of creating a realistic high load situation in order to test the existing hardware devices in a computer. It has been used at Google for some time and now it is available under the apache 2.0 license. We can use it for correctness verification and performance testing.
We accumulate the thread data belonging to the same streeaptest process, and compare the result with
ps -eo Maj_ flt,min_ The results under FLT, PID, CMD
commands are compared, so that the correctness can be verified. We use it to put pressure on the memory to test the performance of the programkernel version:
Linux localhost.localdomain 3.10.0-1160.el7.x86_64
Linux version:
CentOS 7.5
The core processing logic of this function is mainly completed in agent-libs. In kindling, we mainly realize the conversion, processing and encapsulation of initialization data and the implementation of page fault pipeline. All work is carried out on the original architecture, but getpagefaultevent is added