Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

achieve the page fault #15

Open
wants to merge 4 commits into
base: kindling-dev
Choose a base branch
from

Conversation

yaofighting
Copy link

@yaofighting yaofighting commented Jul 13, 2022

Implement the monitoring of major & minor page fault based on page fault tracepoint

Description

In the implementation of this version, our events come from 3 places: one is the initialization information of page fault, which we get from /proc, process it as data under thread granularity, and store it in thread table, and then we further encapsulate these threads as kindlingevent to maintain the consistency with the upper collector data interface; The second is from the capture of page fault events by the perf buffer. When the kernel captures a process exit event, it will send data to the user space, and we will consume these data in the collector.The third source is an eBPF Map, which reads the pagefault value in the Map at certain intervals(5 seconds by default) in user mode.
image

Motivation and Context

We need to use page fault information to judge the memory usage of system processes or threads, so that we can troubleshoot problems and faults. We provide kindling_pagefault_major_total and kindling_pagefault_ minor_total, they have the following meanings:
(1) kindling_pagefault_major_total: Indicates that the system's major page fault data is captured, such as MMAP, SWAP and other scenarios may cause the generation of major page fault, which is associated with a specific thread that generates the page fault. We identify it with 11 labels {node, namespace, workload_kind, workload_name, service, pod, container, container_id, IP, PID, tid}.
(2) kindling_pagefault_minor_total: It indicates that the minor page fault data of the system is captured. For example, using malloc function to apply for and use memory will cause minor page fault, which is also associated with a specific thread that generates the page fault. Similarly, we still use 11 labels {node, namespace, workload_kind, workload_name, service, pod, container, container_id, IP, PID, tid} to identify it.

How Has This Been Tested?

Here we use the 'stressapptest' program to increase memory pressure for the system (the tool project address is: (https://github.com/stressapptest/stressapptest. Stressful Application Test (or stressapptest, its unix name) is a memory interface test. It tries to maximize randomized traffic to memory from processor and I/O, with the intent of creating a realistic high load situation in order to test the existing hardware devices in a computer. It has been used at Google for some time and now it is available under the apache 2.0 license. We can use it for correctness verification and performance testing.
We accumulate the thread data belonging to the same streeaptest process, and compare the result with ps -eo Maj_ flt,min_ The results under FLT, PID, CMD commands are compared, so that the correctness can be verified. We use it to put pressure on the memory to test the performance of the program

kernel version:
Linux localhost.localdomain 3.10.0-1160.el7.x86_64
Linux version:
CentOS 7.5

The core processing logic of this function is mainly completed in agent-libs. In kindling, we mainly realize the conversion, processing and encapsulation of initialization data and the implementation of page fault pipeline. All work is carried out on the original architecture, but getpagefaultevent is added

Signed-off-by: yaofighting <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant