Slow node performance due to audit configuration #3669

njuettner · 2024-09-04T16:26:40Z

Slack Thread: https://gigantic.slack.com/archives/C6L8J93N0/p1724419948903269

TL;DR

Jenkins performs many operations, triggering numerous audit events.
The audit system struggles to keep up, causing delays.
These delays slow down Jenkins operations.

Slowed Jenkins operations take longer, generating even more audit events over time.

Context

After upgrading release v19.3.0 to v20.1.2

flatcar 3602.2.3

docker - 20.10.24
kernel - 5.15.142
cilium - 0.17.0

to

flatcar 3815.2.2

docker - 24.0.9
kernel - 6.1.85
cilium - 0.22.0

Customer noticed a heavy impact on node pools where Jenkins Agents are running. Those nodes were becoming ultra slow. We were able to identify that writing audit messages is the bottleneck:

Sep 04 15:49:22 ip-10-150-45-200.eu-west-1.compute.internal auditd[2537]: Error receiving audit netlink packet (No buffer space available)

We were identifying the audit rules to track what the system is doing:

-a always,exit -F arch=b64 -S execve -F key=auditing

This rule audits all program executions (via the execve system call) on 64-bit systems. It’s a broad rule that captures when any program is run.

-a always,exit -F arch=b32 -S execve -F key=auditing

Similar to the previous rule, but for 32-bit systems. It also audits all program executions.

When running Jenkins it happens that nodes becoming unresponsive for seconds

Example:

time date

real	0m0.877s <- super slow 
user	0m0.000s
sys	0m0.002s

When flushing all audit rules the node becomes instantly responsive again:

time date
Wed Sep  4 15:48:47 UTC 2024

real	0m0.001s
user	0m0.000s
sys	0m0.001s

We're still not sure why this happens now it might be that Jenkins is executing now more commands or auditd has been changed since the last release.

The text was updated successfully, but these errors were encountered:

njuettner · 2024-09-05T06:14:22Z

Additionally we are using -w which is known for bad performance

-w /usr/bin/docker -p rwxa -k docker
-w /var/lib/docker -p rwxa -k docker
-w /etc/docker -p rwxa -k docker
-w /etc/systemd/system/docker.service.d/10-giantswarm-extra-args.conf -p rwxa -k docker
-w /etc/systemd/system/docker.service.d/01-wait-docker.conf -p rwxa -k docker
-w /usr/lib/systemd/system/docker.service -p rwxa -k docker
-w /usr/lib/systemd/system/docker.socket -p rwxa -k docker
-a always,exit -F arch=b64 -S execve -F key=auditing
-a always,exit -F arch=b32 -S execve -F key=auditing

-k key Set a filter key on an audit rule. This is deprecated when
              used with watches. Convert any watches to the syscall form
              of rules. It is still valid for use with deleting or
              listing rules.
-w path
              Place a watch on path. If the path is a file, it's almost
              the same as using the -F path option on a syscall rule. If
              the watch is on a directory, it's almost the same as using
              the -F dir option on a syscall rule. The -w form of
              writing watches is for backwards compatibility and is
              deprecated due to poor system performance.  Convert
              watches of this form to the syscall based form. The only
              valid options when using a watch are the -p and -k.

We need to migrate those using:

-a always,exit -F dir=/usr/bin/docker -S all -F key=docker
-a always,exit -F path=/etc/systemd/system/docker.service.d/10-giantswarm-extra-args.conf -S all -F key=docker
...

Solution for Docker rules, increasing the perfomance

# Audit execution of Docker binary
-a always,exit -F path=/usr/bin/docker -F perm=x -k docker_exec

# Audit writes to Docker data directory
-a always,exit -F dir=/var/lib/docker -F perm=wa -k docker_data

# Audit changes to Docker configuration
-a always,exit -F dir=/etc/docker -F perm=wa -k docker_config

# Audit changes to Docker systemd configuration files
-a always,exit -F path=/etc/systemd/system/docker.service.d/10-giantswarm-extra-args.conf -F perm=wa -k docker_systemd_config
-a always,exit -F path=/etc/systemd/system/docker.service.d/01-wait-docker.conf -F perm=wa -k docker_systemd_config

# Audit changes to Docker service and socket files
-a always,exit -F path=/usr/lib/systemd/system/docker.service -F perm=wa -k docker_service
-a always,exit -F path=/usr/lib/systemd/system/docker.socket -F perm=wa -k docker_socket

njuettner · 2024-09-05T14:43:38Z

Another idea would be changing the auditd.conf:

q_depth = 8192                  # Increase buffer size
max_log_file = 50               # Increase max log file size
num_logs = 10                   # Increase number of logs
freq = 100                      # Decrease log writing frequency
overflow_action = SUSPEND       # Suspend to prevent overflow

Currently q_depth is not set which also might be a reason why we see those errors:
Error receiving audit netlink packet (No buffer space available)

njuettner · 2024-09-05T14:52:14Z

Auditd seems to integrated by @giantswarm/team-atlas, but not sure what the reason behind was:
https://github.com/giantswarm/k8scloudconfig/releases/tag/v16.5.0

@QuentinBisson do you remember why?

njuettner · 2024-09-05T15:41:34Z

After talking with Quentin as a quick solution for Vintage cluster which are affected:

Applying k8s-initiator-app on the nodepools where jenkins is running

Removing
rm /etc/audit/rules.d/99-default.rules

and reloading the rules without those

augenrules --load

For CAPI we would need to integrate a toggle which enables auditd when needed but should be disabled by default, PR to disable it: giantswarm/cluster#325

njuettner · 2024-09-11T13:16:45Z

CAPI: fixed (auditd is disabled by default and can be enabled at anytime) but we need new CAPA and CAPZ releases
Vintage: k8s-initiator-app is not working for removing audit rules. Main issue is auditctl and the dependency mess, it requires a reload because they're kept in-memory, so removing just the files isn't enough. See Slack thread: https://gigantic.slack.com/archives/C062HB29BDG/p1725542379870169

For Vintage we exhausted our options getting around a new release, so we need a new v20 release.

Prepare a new aws-operator release which includes a new annotation to toggle auditd. Should be enabled by default because one customer already relies on it, so we don't mess around IMHO.

Auditd is included in the k8scloudconfig.

njuettner · 2024-09-11T13:25:35Z

@T-Kukawka Could Phoenix start working on it next week please? I'm off the next days otherwise I would jumped in

njuettner · 2024-09-11T13:27:28Z

For tracking: Adidas issue

njuettner · 2024-09-12T05:24:58Z

@T-Kukawka it looks like we don't need to it. It was a final test and it seems we can get around doing a new vintage release: https://gigantic.slack.com/archives/C062HB29BDG/p1726086709037459?thread_ts=1725542379.870169&cid=C062HB29BDG

Daniel figured out setting hostPid: "true" might solve it.

njuettner · 2024-09-26T06:54:09Z

Everything should be covered. Marco created a new release for CAPA ❤

njuettner mentioned this issue Sep 6, 2024

Toggle auditd giantswarm/cluster#325

Merged

1 task

njuettner self-assigned this Sep 6, 2024

architectbot added the team/tenet Team Tenet label Sep 6, 2024

This was referenced Sep 6, 2024

Toggle audit giantswarm/cluster-aws#820

Merged

Toggle audit giantswarm/cluster-azure#340

Merged

njuettner removed their assignment Sep 11, 2024

yulianedyalkova assigned Gacko Sep 17, 2024

T-Kukawka mentioned this issue Sep 17, 2024

Add handling of auditd setting for CAPA migration #3674

Closed

2 tasks

njuettner self-assigned this Sep 25, 2024

njuettner closed this as completed Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow node performance due to audit configuration #3669

Slow node performance due to audit configuration #3669

njuettner commented Sep 4, 2024

njuettner commented Sep 5, 2024 •

edited

Loading

njuettner commented Sep 5, 2024

njuettner commented Sep 5, 2024

njuettner commented Sep 5, 2024 •

edited

Loading

njuettner commented Sep 11, 2024 •

edited

Loading

njuettner commented Sep 11, 2024

njuettner commented Sep 11, 2024

njuettner commented Sep 12, 2024

njuettner commented Sep 26, 2024

Slow node performance due to audit configuration #3669

Slow node performance due to audit configuration #3669

Comments

njuettner commented Sep 4, 2024

Context

njuettner commented Sep 5, 2024 • edited Loading

Solution for Docker rules, increasing the perfomance

njuettner commented Sep 5, 2024

njuettner commented Sep 5, 2024

njuettner commented Sep 5, 2024 • edited Loading

njuettner commented Sep 11, 2024 • edited Loading

njuettner commented Sep 11, 2024

njuettner commented Sep 11, 2024

njuettner commented Sep 12, 2024

njuettner commented Sep 26, 2024

njuettner commented Sep 5, 2024 •

edited

Loading

njuettner commented Sep 5, 2024 •

edited

Loading

njuettner commented Sep 11, 2024 •

edited

Loading