stable-2.14.6 #11707

olix0r · 2023-12-06T22:02:30Z

0a72f1f Add imagePullSecrets to the multicluster chart. (Add imagePullSecrets to the multicluster chart. #11287)
284d76b service-mirror: support gateway resolving to multiple addresses (service-mirror: support gateway resolving to multiple addresses #11499)
64bccd9 Improve klog (client-go logs) handling (Improve klog (client-go logs) handling #11632)
6a07e2c Add ability to configure client-go's QPS and Burst settings (Add ability to configure client-go's QPS and Burst settings #11644)
e294c78 Bump prometheus to v2.48.0 (Bump prometheus to v2.48.0 #11633)

Fixes #11285 Add an imagePullSecrets value to the linkerd-multicluster and linkerd-multicluster-link charts so that imagePullSecrets can be specified for the multicluster extension. --------- Signed-off-by: Louis Haußknecht <[email protected]>

If the gateway address was a name resolving to multiple addresses, previously service-mirror took only one of those address. This behavior could led to downtime in case the gateway behind the IP picked by service-mirror is down. This commit fix this behavior by considering all the IPs resolved. Now that we consider all the IPs resolved by the DNS, we make sure they are sorted lexicographically so that resolveGatewayAddress answer a stable output. Signed-off-by: Arthur Outhenin-Chalandre <[email protected]>

* Improve klog (client-go logs) handling Currently log entries in the go-based controllers originated in the client-go library are only visible if the controller’s log level is `debug`. If the log level is lower (`info`, `warn`, `error`), we lose possibly important information. If the log level is `debug` we receive a lot of entries, mostly irrelevant. client-go uses [klog](https://github.com/kubernetes/klog) for its logs, which relies on [logr](https://github.com/go-logr/logr) as its backend, which is just an interface allowing different implementations, one of which is logrus, which we use in our controllers. So in this change we bring the [logrusr](https://github.com/bombsimon/logrusr) library which implements logrus for logr. The verbosity level for klog is also tweaked a little bit, according to k8s [logging conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/logging.md). Also, given that logging output is now handled completely by logrus, some of the klog config flags are no longer required. As an example of the new behavior in the proxy-injector logs, when removing the RBAC to list pods, we now see the complaints coming from client-go, which were invisible before at `info` level: ```console time="2023-11-17T20:58:30Z" level=info msg="running version dev-94ad50cd-alpeb" time="2023-11-17T20:58:30Z" level=info msg="starting admin server on :9995" time="2023-11-17T20:58:30Z" level=info msg="listening at :8443" time="2023-11-17T20:58:30Z" level=info msg="waiting for caches to sync" time="2023-11-17T20:58:30Z" level=info msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" time="2023-11-17T20:58:30Z" level=error msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" error="<nil>" time="2023-11-17T20:58:32Z" level=info msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" time="2023-11-17T20:58:32Z" level=error msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" error="<nil>" time="2023-11-17T20:58:34Z" level=info msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" time="2023-11-17T20:58:34Z" level=error msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" error="<nil>" time="2023-11-17T20:58:40Z" level=info msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" time="2023-11-17T20:58:40Z" level=error msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" error="<nil>" time="2023-11-17T20:58:48Z" level=info msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" time="2023-11-17T20:58:48Z" level=error msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" error="<nil>" time="2023-11-17T20:59:05Z" level=info msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" time="2023-11-17T20:59:05Z" level=error msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" error="<nil>" time="2023-11-17T20:59:30Z" level=fatal msg="failed to sync caches" ```

* Add ability to configure client-go's `QPS` and `Burst` settings ## Problem and Symptoms When having a very large number of proxies request identity in a short period of time (e.g. during large node scaling events), the identity controller will attempt to validate the tokens sent by the proxies at a rate surpassing client-go's the default request rate threshold, triggering client-side throttling, which will delay the proxies initialization, and even failing their startup (after a 2m timeout). The identity controller will surface this through log entries like this: ``` time="2023-11-08T19:50:45Z" level=error msg="error validating token for web.emojivoto.serviceaccount.identity.linkerd.cluster.local: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline" ``` ## Solution Client-go's default `QPS` is 5 and `Burst` is 10. This PR exposes those settings as entries in `values.yaml` with defaults of 100 and 200 respectively. Note this only applies to the identity controller, as it's the only controller performing direct requests to the `kube-apiserver` in a hot path. The other controllers mostly rely in informers, and direct calls are sporadic. ## Observability The `QPS` and `Burst` settings used are exposed both as a log entry as soon as the controller starts, and as in the new metric gauges `http_client_qps` and `http_client_burst` ## Testing You can use the following K6 script, which simulates 6k calls to the `Certify` service during one minute from emojivoto's web pod. Before running this you need to: - Put the identity.proto and [all the other proto files](https://github.com/linkerd/linkerd2-proxy-api/tree/v0.11.0/proto) in the same directory. - Edit the [checkRequest](https://github.com/linkerd/linkerd2/blob/edge-23.11.3/pkg/identity/service.go#L266) function and add logging statements to figure the `token` and `csr` entries you can use here, that will be shown as soon as a web pod starts. ```javascript import { Client, Stream } from 'k6/experimental/grpc'; import { sleep } from 'k6'; const client = new Client(); client.load(['.'], 'identity.proto'); // This always holds: // req_num = (1 / req_duration ) * duration * VUs // Given req_duration (0.5s) test duration (1m) and the target req_num (6k), we // can solve for the required VUs: // VUs = req_num * req_duration / duration // VUs = 6000 * 0.5 / 60 = 50 export const options = { scenarios: { identity: { executor: 'constant-vus', vus: 50, duration: '1m', }, }, }; export default () => { client.connect('localhost:8080', { plaintext: true, }); const stream = new Stream(client, 'io.linkerd.proxy.identity.Identity/Certify'); // Replace with your own token let token = "ZXlKaGJHY2lPaUpTVXpJMU5pSXNJbXRwWkNJNkluQjBaV1pUZWtaNWQyVm5OMmxmTTBkV2VUTlhWSFpqTmxwSmJYRmtNMWRSVEhwNVNHWllhUzFaZDNNaWZRLmV5SmhkV1FpT2xzaWFXUmxiblJwZEhrdWJEVmtMbWx2SWwwc0ltVjRjQ0k2TVRjd01EWTRPVFk1TUN3aWFXRjBJam94TnpBd05qQXpNamt3TENKcGMzTWlPaUpvZEhSd2N6b3ZMMnQxWW1WeWJtVjBaWE11WkdWbVlYVnNkQzV6ZG1NdVkyeDFjM1JsY2k1c2IyTmhiQ0lzSW10MVltVnlibVYwWlhNdWFXOGlPbnNpYm1GdFpYTndZV05sSWpvaVpXMXZhbWwyYjNSdklpd2ljRzlrSWpwN0ltNWhiV1VpT2lKM1pXSXRPRFUxT1dJNU4yWTNZeTEwYldJNU5TSXNJblZwWkNJNklqaGlZbUV5WWpsbExXTXdOVGN0TkRnMk1TMWhNalZsTFRjelpEY3dOV1EzWmpoaU1TSjlMQ0p6WlhKMmFXTmxZV05qYjNWdWRDSTZleUp1WVcxbElqb2lkMlZpSWl3aWRXbGtJam9pWm1JelpUQXlNRE10TmpZMU55MDBOMk0xTFRoa09EUXRORGt6WXpBM1lXUTJaak0zSW4xOUxDSnVZbVlpT2pFM01EQTJNRE15T1RBc0luTjFZaUk2SW5ONWMzUmxiVHB6WlhKMmFXTmxZV05qYjNWdWREcGxiVzlxYVhadmRHODZkMlZpSW4wLnlwMzAzZVZkeHhpamxBOG1wVjFObGZKUDB3SC03RmpUQl9PcWJ3NTNPeGU1cnNTcDNNNk96VWR6OFdhYS1hcjNkVVhQR2x2QXRDRVU2RjJUN1lKUFoxVmxxOUFZZTNvV2YwOXUzOWRodUU1ZDhEX21JUl9rWDUxY193am9UcVlORHA5ZzZ4ZFJNcW9reGg3NE9GNXFjaEFhRGtENUJNZVZ6a25kUWZtVVZwME5BdTdDMTZ3UFZWSlFmNlVXRGtnYkI1SW9UQXpxSmcyWlpyNXBBY3F5enJ0WE1rRkhSWmdvYUxVam5sN1FwX0ljWm8yYzJWWk03T2QzRjIwcFZaVzJvejlOdGt3THZoSEhSMkc5WlNJQ3RHRjdhTkYwNVR5ZC1UeU1BVnZMYnM0ZFl1clRYaHNORjhQMVk4RmFuNjE4d0x6ZUVMOUkzS1BJLUctUXRUNHhWdw=="; // Replace with your own CSR let csr = "MIIBWjCCAQECAQAwRjFEMEIGA1UEAxM7d2ViLmVtb2ppdm90by5zZXJ2aWNlYWNjb3VudC5pZGVudGl0eS5saW5rZXJkLmNsdXN0ZXIubG9jYWwwWTATBgcqhkjOPQIBBggqhkjOPQMBBwNCAATKjgVXu6F+WCda3Bbq2ue6m3z6OTMfQ4Vnmekmvirip/XGyi2HbzRzjARnIzGlG8wo4EfeYBtd2MBCb50kP8F8oFkwVwYJKoZIhvcNAQkOMUowSDBGBgNVHREEPzA9gjt3ZWIuZW1vaml2b3RvLnNlcnZpY2VhY2NvdW50LmlkZW50aXR5LmxpbmtlcmQuY2x1c3Rlci5sb2NhbDAKBggqhkjOPQQDAgNHADBEAiAM7aXY8MRs/EOhtPo4+PRHuiNOV+nsmNDv5lvtJt8T+QIgFP5JAq0iq7M6ShRNkRG99ZquJ3L3TtLWMNVTPvqvvUE="; const data = { identity: "web.emojivoto.serviceaccount.identity.linkerd.cluster.local", token: token, certificate_signing_request: csr, }; stream.write(data); // This request takes around 2ms, so this sleep will mostly determine its final duration sleep(0.5); }; ``` This results in the following report: ``` scenarios: (100.00%) 1 scenario, 50 max VUs, 1m30s max duration (incl. graceful stop): * identity: 50 looping VUs for 1m0s (gracefulStop: 30s) data_received................: 6.3 MB 104 kB/s data_sent....................: 9.4 MB 156 kB/s grpc_req_duration............: avg=2.14ms min=873.93µs med=1.9ms max=12.89ms p(90)=3.13ms p(95)=3.86ms grpc_streams.................: 6000 99.355331/s grpc_streams_msgs_received...: 6000 99.355331/s grpc_streams_msgs_sent.......: 6000 99.355331/s iteration_duration...........: avg=503.16ms min=500.8ms med=502.64ms max=532.36ms p(90)=504.05ms p(95)=505.72ms iterations...................: 6000 99.355331/s vus..........................: 50 min=50 max=50 vus_max......................: 50 min=50 max=50 running (1m00.4s), 00/50 VUs, 6000 complete and 0 interrupted iterations ``` With the old defaults (QPS=5 and Burst=10), the latencies would be much higher and number of complete requests much lower.

mateiidavid

Notes look good to me. We also need to bump the chart versions for the core extensions & control plane.

alpeb

Do you mind adding the PR/issues links to the notes?

* Bump prometheus to v2.48.0 This gets rids of most CVEs: ```bash $ grype -q prom/prometheus:v2.47.0 NAME INSTALLED FIXED-IN TYPE VULNERABILITY SEVERITY github.com/docker/docker v24.0.4+incompatible 24.0.7 go-module GHSA-jq35-85cj-fj4p Medium github.com/prometheus/alertmanager v0.25.0 0.25.1 go-module GHSA-v86x-5fm3-5p7j Medium github.com/prometheus/alertmanager v0.25.0 go-module CVE-2023-40577 Medium go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.42.0 0.44.0 go-module GHSA-rcjv-mgp8-qvmr High golang.org/x/net v0.12.0 0.17.0 go-module GHSA-4374-p667-p6c8 High golang.org/x/net v0.12.0 0.17.0 go-module GHSA-qppj-fm5r-hxr3 Medium golang.org/x/net v0.12.0 0.13.0 go-module GHSA-2wrh-6pvc-2jm9 Medium google.golang.org/grpc v1.56.2 1.56.3 go-module GHSA-m425-mq94-257g High google.golang.org/grpc v1.56.2 1.56.3 go-module GHSA-qppj-fm5r-hxr3 Medium $ grype -q prom/prometheus:v2.48.0 NAME INSTALLED FIXED-IN TYPE VULNERABILITY SEVERITY github.com/docker/docker v24.0.6+incompatible 24.0.7 go-module GHSA-jq35-85cj-fj4p Medium ```

alpeb

Sweet! Only thing left is the linkerd-control-plane chart version bump.

* 0a72f1f Add imagePullSecrets to the multicluster chart. (#11287) * 284d76b service-mirror: support gateway resolving to multiple addresses (#11499) * 64bccd9 Improve klog (client-go logs) handling (#11632) * 6a07e2c Add ability to configure client-go's `QPS` and `Burst` settings (#11644) * e294c78 Bump prometheus to v2.48.0 (#11633) * b24b0e97c stable-2.14.6

lhaussknecht and others added 4 commits December 6, 2023 21:42

olix0r requested a review from a team as a code owner December 6, 2023 22:02

olix0r changed the title ~~v2.14.6~~ stable-2.14.6 Dec 6, 2023

mateiidavid reviewed Dec 7, 2023

View reviewed changes

alpeb reviewed Dec 7, 2023

View reviewed changes

olix0r force-pushed the ver/stable-2.14.6 branch from f5c66a5 to 75fb714 Compare December 7, 2023 14:22

alpeb approved these changes Dec 7, 2023

View reviewed changes

olix0r force-pushed the ver/stable-2.14.6 branch from 75fb714 to 9893216 Compare December 7, 2023 16:02

mateiidavid approved these changes Dec 7, 2023

View reviewed changes

olix0r merged commit 11b5ab8 into release/stable-2.14 Dec 7, 2023
35 checks passed

olix0r deleted the ver/stable-2.14.6 branch December 7, 2023 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stable-2.14.6 #11707

stable-2.14.6 #11707

olix0r commented Dec 6, 2023 •

edited

Loading

mateiidavid left a comment

alpeb left a comment

alpeb left a comment

stable-2.14.6 #11707

stable-2.14.6 #11707

Conversation

olix0r commented Dec 6, 2023 • edited Loading

mateiidavid left a comment

Choose a reason for hiding this comment

alpeb left a comment

Choose a reason for hiding this comment

alpeb left a comment

Choose a reason for hiding this comment

olix0r commented Dec 6, 2023 •

edited

Loading