achieve the page fault #15

yaofighting · 2022-07-13T07:30:35Z

Implement the monitoring of major & minor page fault based on page fault tracepoint

Description

In the implementation of this version, our events come from 3 places: one is the initialization information of page fault, which we get from /proc, process it as data under thread granularity, and store it in thread table, and then we further encapsulate these threads as kindlingevent to maintain the consistency with the upper collector data interface; The second is from the capture of page fault events by the perf buffer. When the kernel captures a process exit event, it will send data to the user space, and we will consume these data in the collector.The third source is an eBPF Map, which reads the pagefault value in the Map at certain intervals(5 seconds by default) in user mode.

Motivation and Context

We need to use page fault information to judge the memory usage of system processes or threads, so that we can troubleshoot problems and faults. We provide kindling_pagefault_major_total and kindling_pagefault_ minor_total, they have the following meanings:
(1) kindling_pagefault_major_total: Indicates that the system's major page fault data is captured, such as MMAP, SWAP and other scenarios may cause the generation of major page fault, which is associated with a specific thread that generates the page fault. We identify it with 11 labels {node, namespace, workload_kind, workload_name, service, pod, container, container_id, IP, PID, tid}.
(2) kindling_pagefault_minor_total: It indicates that the minor page fault data of the system is captured. For example, using malloc function to apply for and use memory will cause minor page fault, which is also associated with a specific thread that generates the page fault. Similarly, we still use 11 labels {node, namespace, workload_kind, workload_name, service, pod, container, container_id, IP, PID, tid} to identify it.

How Has This Been Tested?

Here we use the 'stressapptest' program to increase memory pressure for the system (the tool project address is: （https://github.com/stressapptest/stressapptest. Stressful Application Test (or stressapptest, its unix name) is a memory interface test. It tries to maximize randomized traffic to memory from processor and I/O, with the intent of creating a realistic high load situation in order to test the existing hardware devices in a computer. It has been used at Google for some time and now it is available under the apache 2.0 license. We can use it for correctness verification and performance testing.
We accumulate the thread data belonging to the same streeaptest process, and compare the result with ps -eo Maj_ flt,min_ The results under FLT, PID, CMD commands are compared, so that the correctness can be verified. We use it to put pressure on the memory to test the performance of the program

kernel version:
Linux localhost.localdomain 3.10.0-1160.el7.x86_64
Linux version:
CentOS 7.5

The core processing logic of this function is mainly completed in agent-libs. In kindling, we mainly realize the conversion, processing and encapsulation of initialization data and the implementation of page fault pipeline. All work is carried out on the original architecture, but getpagefaultevent is added

Signed-off-by: yaofighting <[email protected]>

support page fault

7b34ded

Signed-off-by: yaofighting <[email protected]>

yaofighting force-pushed the feat/page-fault branch from 4ffb86f to 7b34ded Compare December 20, 2022 04:15

yaofighting added 3 commits March 15, 2023 17:02

Merge branch 'kindling-dev' into feat/page-fault

ceb0e2d

Optimization: Improve the performance of pagefault.

552ede1

Signed-off-by: yaofighting <[email protected]>

add maxlen to avoid array out of bounds.

f4310ad

Signed-off-by: yaofighting <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

achieve the page fault #15

achieve the page fault #15

yaofighting commented Jul 13, 2022 •

edited

Loading

achieve the page fault #15

Are you sure you want to change the base?

achieve the page fault #15

Conversation

yaofighting commented Jul 13, 2022 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

yaofighting commented Jul 13, 2022 •

edited

Loading