-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(test/libsinsp_e2e): fixed libsinsp_e2e tests for more stability #2085
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: FedeDP The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* In this case, we won't push the new character, instead we will push the correct string. | ||
*/ | ||
if(kn) { | ||
push__new_character(auxmap->data, &auxmap->payload_pos, '/'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix a bug in how modern_bpf sent cgroups, when number of paths component was greater than MAX_CGROUP_PATH_POINTERS
.
For example, for
/user.slice/user-1000.slice/[email protected]/app.slice/app-org.gnome.Terminal.slice/vte-spawn-2f17b2eb-994e-415d-bce0-44c1447d7cd2.scope
that is 7 paths long (considering first one, ie: root, is returned by the kernel as an empty string), we would have returned to userspace:
user.slice/user-1000.slice/[email protected]/app.slice/app-org.gnome.Terminal.slice/vte-spawn-2f17b2eb-994e-415d-bce0-44c1447d7cd2.scope
ie: the leading /
was missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cc @Andreagit97
@@ -30,7 +30,7 @@ constexpr const cgroup_layout DOCKER_CGROUP_LAYOUT[] = {{"/", ""}, // non-syste | |||
{nullptr, nullptr}}; | |||
} | |||
|
|||
std::string docker_linux::m_docker_sock = "/var/run/docker.sock"; | |||
std::string docker_linux::s_docker_sock = "/var/run/docker.sock"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a static, rename it like we use to call static members, with s_
prefix.
// For cgroups like: | ||
// /machine.slice/machine-lxc\x2d2293906\x2dlibvirt\x2dcontainer.scope/libvirt, | ||
// account for /libvirt below. | ||
if(cgroup.find(".scope/libvirt") != std::string::npos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new cgroup layout for libvirt-lxc.
// Just a stupid fake FD value to signal to stop capturing events from driver and exit. | ||
// Note: we don't use it through eventfd because we want to make sure | ||
// that we received all events from the driver, until this very last close(FD_SIGNAL_STOP); | ||
#define EVENTFD_SIGNAL_STOP 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid infinite loops when missing syscalls, after each test run() callback is called, we trigger the eventfd to signal that the capture must be stopped and all remaining events consumed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: i avoided sending a fake syscall event (like a close(555)
) because if you are losing events, chances are high that you'll lose also this canary event, thus leading to infinitely looping.
|
||
static void do_nothing(sinsp* inspector) {} | ||
|
||
static bool always_continue() { return true; } | ||
static void run(const run_callback_t& run_function, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now have 2 different fashioned run
static functions exposed by the event_capture
class: one run the test callback (run_function
) synchronously and then gathers all generated events; this one can also access the inspector in the callback since it is synchronous.
The other one is async, thus the callback is prevented to access the inspector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: i kept the same name because i found it easier to integrate; so the only difference between a sync and an async invocation is just the first parameter type (run_callback_t vs run_callback_async_t).
Perf diff from master - unit tests
Heap diff from master - unit tests
Heap diff from master - scap file
Benchmarks diff from master
|
if(m_mode != SINSP_MODE_NODRIVER && m_dump) { | ||
dumper->dump(event); | ||
} | ||
handle_eventfd_request(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before doing anything else, check if any eventfd request was sent; this is non-blocking of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are wondering why i did not just use eg:
close(MY_FAKE_FD)
to signal the capture to end, is because during my tests it happened that the signaling event got lost and then ♾️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, having a read
on each loop creates small back-pressure on the drivers.
But tests that generate lots of syscalls traffic are now enabling only the sinsp state sc set of syscalls, therefore in that case, no read
will appear in our syscall event loop ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest changes put, by default, the whole sc set minus read
and readv
to avoid the backpressure (ie: we do not push eventfd read related events to userspace).
Of course some tests rely on reads, in that case they must explicitly enable the libsinsp::events::all_sc_set()
.
Also, the sc set is now a parameter of event_capture::run()
APIs.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2085 +/- ##
==========================================
- Coverage 73.58% 73.58% -0.01%
==========================================
Files 253 253
Lines 31869 31872 +3
Branches 5649 5635 -14
==========================================
Hits 23452 23452
+ Misses 8416 8405 -11
- Partials 1 15 +14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
FAIL() << "caught exception " << e.what(); | ||
} | ||
auto capture_stats_str = capture_stats(m_inspector.get()); | ||
std::cout << capture_stats_str << "\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always print capture_stats upon leaving so that we know if any drop was present.
X64 kernel testing matrix
ARM64 kernel testing matrix
|
/milestone 0.19.0 |
Tests with new fixes:
We are still losing some events sometimes and that leads to failures. Let's see if i can avoid them. |
|
||
std::string event_capture::s_engine_string = KMOD_ENGINE; | ||
std::string event_capture::s_engine_path; | ||
unsigned long event_capture::s_buffer_dim = DEFAULT_DRIVER_BUFFER_BYTES_DIM * 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use big buffers to avoid losing events.
Arm64 modern-ebpf tests failed again with latest code, even if there were no drops :/ |
… in e2e tests. Signed-off-by: Federico Di Pierro <[email protected]>
Signed-off-by: Federico Di Pierro <[email protected]>
…s time to leave. The `close` syscall might get lost leading to an infinite loop; instead, now we ask to the main thread to leave using thread safe eventfd, and the main thread will dequeue all remaining events until an error is returned by sinsp::next. Signed-off-by: Federico Di Pierro <[email protected]>
Signed-off-by: Federico Di Pierro <[email protected]>
…er resolving on newer linux systemd systems. This fixes the `sys_call_test.container_libvirt` running on my machine. Also, let event_capture always print capture stats for us. Signed-off-by: Federico Di Pierro <[email protected]>
…ed_early`. Signed-off-by: Federico Di Pierro <[email protected]>
Signed-off-by: Federico Di Pierro <[email protected]>
… is a static. Rename `m_docker_sock` to `s_docker_sock` to highlight that it is static. Signed-off-by: Federico Di Pierro <[email protected]>
…ents > MAX_CGROUP_PATH_POINTERS Signed-off-by: Federico Di Pierro <[email protected]>
…erver_with_connection_before_capturing_starts_ipv4m` test. Signed-off-by: Federico Di Pierro <[email protected]>
…est to avoid drops. Signed-off-by: Federico Di Pierro <[email protected]>
Signed-off-by: Federico Di Pierro <[email protected]>
Default interesting syscalls set now avoids `read` and `pread` to avoid back-pressure with `eventfd_read` being called at each loop iteration. Moreover, `event_capture::run()` now accepts a ppm_sc_set parameter to customize the sc set for the test. Finally, in rlimit related tests, reset old limits upon leaving. Signed-off-by: Federico Di Pierro <[email protected]>
Signed-off-by: Federico Di Pierro <[email protected]>
… losses. Signed-off-by: Federico Di Pierro <[email protected]>
Signed-off-by: Federico Di Pierro <[email protected]>
Signed-off-by: Federico Di Pierro <[email protected]>
… reliable. Signed-off-by: Federico Di Pierro <[email protected]>
Love this! What an amazing job! LGTM! 🚀 |
What type of PR is this?
/kind bug
Any specific area of the project related to this PR?
/area driver-modern-bpf
/area libsinsp
/area tests
Does this PR require a change in the driver versions?
What this PR does / why we need it:
This also fixes a couple of small issues found while testing, namely:
fd
s upon leaving.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Will leave a self-review before removing wip.
Also, i will trigger the CI many times to see if we are really stable; see #2085 (comment) for the results.
Does this PR introduce a user-facing change?: