v0.11.0
Version v0.11.0 released!
This release continues to improve performance and memory usage in large K8s clusters (> 5000 pods) as well as providing some quality of life improvements. This release was tested against a large stress testing cluster of 10,000 active pods.
- Updated internal dependencies.
- Improve logging at
Info
level (Info
level will become default in a future release). Monitored injection status is now logged atInfo
level to aid in tracking pods in-which injection is pending. - Reduced default operator event queue size, aimed at reducing retained memory during operator lag in huge clusters (> 30,000 tracked entities). In effect, this reduces Gen2 retained allocations, reducing the need for expensive Gen2 GC sweeps.
- Improved internal state indexing of data, reducing desired state calculations from a
O(N^3)
problem to aO(N)
problem. This change also reduces memory complexity significantly, while also reducing cluster lag in large clusters (> 5000 pods). In effect, this increases calculation throughput by a factor of 50+ in large clusters, while also reducing allocation traffic. - Reduced allocations by improving data structure re-use and reducing closure usage along hot paths. In extreme cases, these changes significantly reduce promotion of objects from Gen0 to Gen2, reducing the need for expensive Gen2 GC sweeps.
- Increased the event stream watcher timeout (not user configurable) from 60 seconds to 10 minutes - reducing full-sync network traffic against the backplane. This may improve the load of the backplane in large clusters.
- Fixed TLS key usage attributes of internally generated certificates to match the TLS 1.3 specification. Operator installations, with incorrect certificates, will automatically generate new certificates upon upgrading. This bug was found during internal testing and is not user facing as the backplane does not appear to validate key usage at this time.
- Speculative fix around the Agent Operator Helm chart to work around a bug found in AWS's K8s implementation, preventing installation in
1.21
clusters.
Known Issues
During dogfooding against our internal K8s clusters, we've discovered that the TLS certificate fix could prevent newer instances of the operator from coming online during the K8s rolling deployment (due to failing health checks). This will be fixed in the next, soon to be released, release. Two workarounds can be used to continue upgrading:
- Scale down an update deployment to 0 replicas, and scale back to your standard replica count.
- Delete and then recreate the deployment.
Upon starting and gaining a leader lock, the operator will update the TLS certificate and continue running. It is the policy of the Agent Operator to not require human intervention during point releases such as v0.10 to v0.11.
contrast/agent-operator:0.11.0
contrast/agent-operator@sha256:c298eb61975c82060b799c1b96390ab2d7087f60e64f8fc76a0a4a3cb4214bf9
quay.io/contrast/agent-operator:0.11.0
quay.io/contrast/agent-operator@sha256:c298eb61975c82060b799c1b96390ab2d7087f60e64f8fc76a0a4a3cb4214bf9