Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribute audit to multiple pods #2981

Closed
JaydipGabani opened this issue Aug 30, 2023 · 5 comments
Closed

Distribute audit to multiple pods #2981

JaydipGabani opened this issue Aug 30, 2023 · 5 comments
Labels
enhancement New feature or request stale

Comments

@JaydipGabani
Copy link
Contributor

JaydipGabani commented Aug 30, 2023

Describe the solution you'd like
[A clear and concise description of what you want to happen.]
With Audit reporting violations through PubSub, splitting up an audit with multiple pods to distribute load across might be helpful.

Some benefits it may provide are:

  • Eliminating single point of failure for audit pod
  • Reduced I/O throttling for each audit pod
  • Reduced audit duration

Questions to be answered:

  • How do we split up audit? by GVK?

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Gatekeeper version:
  • Kubernetes version: (use kubectl version):
@JaydipGabani JaydipGabani added the enhancement New feature or request label Aug 30, 2023
@maxsmythe
Copy link
Contributor

maxsmythe commented Aug 30, 2023

We probably don't want to be too aggressive with audit. Audit reaches out directly to the API server. Excess load could cause instability or latency in the cluster control plane. Or effectively DOS the control plane by throttling (see `--max-requests-inflight and https://kubernetes.io/docs/concepts/cluster-administration/flow-control/#seats-occupied-by-a-request )

In addition, audit is less latency-sensitive than webhooks, given that it is not in the critical path of the request. More audit results returned more frequently is definitely good, but needs to be weighed against the potential costs (resources, impact on cluster uptime, infrastructure complexity) and the decreasing marginal utility (is returning audit results every 5 minutes significantly better than every ten minutes? Do users require that amount of granularity?)

WRT potential benefits cited above:

Eliminating single point of failure for audit pod

I don't think this is true. This is evident in your follow-on question: "how do we split up audit? by GVK?", if we had multiple audit pods, each of which is responsible for a discrete portion of the audit task, then all pods must be functional for an audit result to be valid.

If the probability of a pod being up for the duration of an audit run is 99%, the probability of 5 pods being up for the duration of the audit run is (99%) ^ 5 ~= 95%, which means we've decreased audit reliability while increasing operating costs. It's true I'm making some hand-wavy assumptions in the above math (e.g. the duration of an audit run being equal for all pods), but the general idea is valid, since all probabilities are <= 1.

If we wanted to improve reliability for a given audit run, we could run multiple full audits in parallel, but that would increase running costs, load on the API server, and only marginally benefit availability (one or more pods must stay alive the whole cycle, so 1 - (1 - 0.99) ^ 5 == ~ 10 nines of availability -- this likely overestimates improvement because of overlapping failure domains, such as the K8s API server itself).

We could use leader election and run multiple audit pods, but that just means we have a hot standby. Without leader election, K8s's Deployment infrastructure will restart a failing audit pod.

Reduced I/O throttling for each audit pod

This is not a given either. The K8s API server may be more likely to throttle if our request volume increases. Any pubsub system we use would also presumably have protection against unintended DOS attacks from over-active clients.

If we are talking about increased drive IOPS available for writing out scratch data to a pod drive, having multiple pods may do this (assuming they are writing to different storage backends), but there may be other solutions (RAM disk, SSDs, networked RAID, options vary based on cost constraints and hosting provider). I also suspect that if we are writing out enough audit data for drive throttling to be a concern, then API server load is also worth caring about, since list requests would be quite expensive.

How do we split up audit? by GVK?

This is another problem with splitting up audit. GVK is the only realistic method, since we can't really chunk lists (the only thing that keeps LIST calls paginate-able/consistent is the etcd resumption key, which is not shared across hosts). However, # of resources are not evenly spread across GVKs (are there more ValidatingWebhookConfigurations or Pods?). There might be some correlations (I'd expect # Pods to be proportional to the number of Deployments), and some correlations may exist for specific users (is there a correlation between # of Namespaces and # of pods? Sometimes, but depends on how the user leverages their cluster). I suspect any answer we come up with here will only fit a subset of users.

Splitting up audit also implies non-trivial coordination problems. Currently we identify a unique audit run using a given audit ID. Are we expecting to maintain this convention? How would multiple pods build a consensus as to what ID to use? In the event one pod goes down, how do we know when it's appropriate to retire a given audit ID? These are solvable problems, but not simple ones.

Copy link

stale bot commented Nov 12, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Nov 12, 2023
@stale stale bot closed this as completed Nov 26, 2023
@KKonak
Copy link

KKonak commented Apr 8, 2024

@maxsmythe has this been considered any further, particularly to eliminate single point of failure for the audit pod? I'm experiencing an issue where jobs get started during failure of the node hosting the audit pod causing the jobs to not get mutated during this brief period of migration.

@maxsmythe
Copy link
Contributor

from

causing the jobs to not get mutated during this brief period of migration.

I'm a bit confused, audit pod should not impact mutation one way or another? It is certainly possible to run multiple mutating pods (G8r by default runs 3 in the same pod as the admission webhook IIRC)

@KKonak
Copy link

KKonak commented Apr 13, 2024

I misunderstood the function of the audit pod and found a solution. We install helms in the process of setting up our environment, with the preferred while scheduling affinity all controllers were starting off on one node and when failing the node all those would reschedule leading to the temporary down time i was seeing.

Thanks for the reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

3 participants