-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stable-2.14.6 #11707
stable-2.14.6 #11707
Conversation
Fixes #11285 Add an imagePullSecrets value to the linkerd-multicluster and linkerd-multicluster-link charts so that imagePullSecrets can be specified for the multicluster extension. --------- Signed-off-by: Louis Haußknecht <[email protected]>
If the gateway address was a name resolving to multiple addresses, previously service-mirror took only one of those address. This behavior could led to downtime in case the gateway behind the IP picked by service-mirror is down. This commit fix this behavior by considering all the IPs resolved. Now that we consider all the IPs resolved by the DNS, we make sure they are sorted lexicographically so that resolveGatewayAddress answer a stable output. Signed-off-by: Arthur Outhenin-Chalandre <[email protected]>
* Improve klog (client-go logs) handling Currently log entries in the go-based controllers originated in the client-go library are only visible if the controller’s log level is `debug`. If the log level is lower (`info`, `warn`, `error`), we lose possibly important information. If the log level is `debug` we receive a lot of entries, mostly irrelevant. client-go uses [klog](https://github.com/kubernetes/klog) for its logs, which relies on [logr](https://github.com/go-logr/logr) as its backend, which is just an interface allowing different implementations, one of which is logrus, which we use in our controllers. So in this change we bring the [logrusr](https://github.com/bombsimon/logrusr) library which implements logrus for logr. The verbosity level for klog is also tweaked a little bit, according to k8s [logging conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/logging.md). Also, given that logging output is now handled completely by logrus, some of the klog config flags are no longer required. As an example of the new behavior in the proxy-injector logs, when removing the RBAC to list pods, we now see the complaints coming from client-go, which were invisible before at `info` level: ```console time="2023-11-17T20:58:30Z" level=info msg="running version dev-94ad50cd-alpeb" time="2023-11-17T20:58:30Z" level=info msg="starting admin server on :9995" time="2023-11-17T20:58:30Z" level=info msg="listening at :8443" time="2023-11-17T20:58:30Z" level=info msg="waiting for caches to sync" time="2023-11-17T20:58:30Z" level=info msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" time="2023-11-17T20:58:30Z" level=error msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" error="<nil>" time="2023-11-17T20:58:32Z" level=info msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" time="2023-11-17T20:58:32Z" level=error msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" error="<nil>" time="2023-11-17T20:58:34Z" level=info msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" time="2023-11-17T20:58:34Z" level=error msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" error="<nil>" time="2023-11-17T20:58:40Z" level=info msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" time="2023-11-17T20:58:40Z" level=error msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" error="<nil>" time="2023-11-17T20:58:48Z" level=info msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" time="2023-11-17T20:58:48Z" level=error msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" error="<nil>" time="2023-11-17T20:59:05Z" level=info msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" time="2023-11-17T20:59:05Z" level=error msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" error="<nil>" time="2023-11-17T20:59:30Z" level=fatal msg="failed to sync caches" ```
* Add ability to configure client-go's `QPS` and `Burst` settings ## Problem and Symptoms When having a very large number of proxies request identity in a short period of time (e.g. during large node scaling events), the identity controller will attempt to validate the tokens sent by the proxies at a rate surpassing client-go's the default request rate threshold, triggering client-side throttling, which will delay the proxies initialization, and even failing their startup (after a 2m timeout). The identity controller will surface this through log entries like this: ``` time="2023-11-08T19:50:45Z" level=error msg="error validating token for web.emojivoto.serviceaccount.identity.linkerd.cluster.local: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline" ``` ## Solution Client-go's default `QPS` is 5 and `Burst` is 10. This PR exposes those settings as entries in `values.yaml` with defaults of 100 and 200 respectively. Note this only applies to the identity controller, as it's the only controller performing direct requests to the `kube-apiserver` in a hot path. The other controllers mostly rely in informers, and direct calls are sporadic. ## Observability The `QPS` and `Burst` settings used are exposed both as a log entry as soon as the controller starts, and as in the new metric gauges `http_client_qps` and `http_client_burst` ## Testing You can use the following K6 script, which simulates 6k calls to the `Certify` service during one minute from emojivoto's web pod. Before running this you need to: - Put the identity.proto and [all the other proto files](https://github.com/linkerd/linkerd2-proxy-api/tree/v0.11.0/proto) in the same directory. - Edit the [checkRequest](https://github.com/linkerd/linkerd2/blob/edge-23.11.3/pkg/identity/service.go#L266) function and add logging statements to figure the `token` and `csr` entries you can use here, that will be shown as soon as a web pod starts. ```javascript import { Client, Stream } from 'k6/experimental/grpc'; import { sleep } from 'k6'; const client = new Client(); client.load(['.'], 'identity.proto'); // This always holds: // req_num = (1 / req_duration ) * duration * VUs // Given req_duration (0.5s) test duration (1m) and the target req_num (6k), we // can solve for the required VUs: // VUs = req_num * req_duration / duration // VUs = 6000 * 0.5 / 60 = 50 export const options = { scenarios: { identity: { executor: 'constant-vus', vus: 50, duration: '1m', }, }, }; export default () => { client.connect('localhost:8080', { plaintext: true, }); const stream = new Stream(client, 'io.linkerd.proxy.identity.Identity/Certify'); // Replace with your own token let token = "ZXlKaGJHY2lPaUpTVXpJMU5pSXNJbXRwWkNJNkluQjBaV1pUZWtaNWQyVm5OMmxmTTBkV2VUTlhWSFpqTmxwSmJYRmtNMWRSVEhwNVNHWllhUzFaZDNNaWZRLmV5SmhkV1FpT2xzaWFXUmxiblJwZEhrdWJEVmtMbWx2SWwwc0ltVjRjQ0k2TVRjd01EWTRPVFk1TUN3aWFXRjBJam94TnpBd05qQXpNamt3TENKcGMzTWlPaUpvZEhSd2N6b3ZMMnQxWW1WeWJtVjBaWE11WkdWbVlYVnNkQzV6ZG1NdVkyeDFjM1JsY2k1c2IyTmhiQ0lzSW10MVltVnlibVYwWlhNdWFXOGlPbnNpYm1GdFpYTndZV05sSWpvaVpXMXZhbWwyYjNSdklpd2ljRzlrSWpwN0ltNWhiV1VpT2lKM1pXSXRPRFUxT1dJNU4yWTNZeTEwYldJNU5TSXNJblZwWkNJNklqaGlZbUV5WWpsbExXTXdOVGN0TkRnMk1TMWhNalZsTFRjelpEY3dOV1EzWmpoaU1TSjlMQ0p6WlhKMmFXTmxZV05qYjNWdWRDSTZleUp1WVcxbElqb2lkMlZpSWl3aWRXbGtJam9pWm1JelpUQXlNRE10TmpZMU55MDBOMk0xTFRoa09EUXRORGt6WXpBM1lXUTJaak0zSW4xOUxDSnVZbVlpT2pFM01EQTJNRE15T1RBc0luTjFZaUk2SW5ONWMzUmxiVHB6WlhKMmFXTmxZV05qYjNWdWREcGxiVzlxYVhadmRHODZkMlZpSW4wLnlwMzAzZVZkeHhpamxBOG1wVjFObGZKUDB3SC03RmpUQl9PcWJ3NTNPeGU1cnNTcDNNNk96VWR6OFdhYS1hcjNkVVhQR2x2QXRDRVU2RjJUN1lKUFoxVmxxOUFZZTNvV2YwOXUzOWRodUU1ZDhEX21JUl9rWDUxY193am9UcVlORHA5ZzZ4ZFJNcW9reGg3NE9GNXFjaEFhRGtENUJNZVZ6a25kUWZtVVZwME5BdTdDMTZ3UFZWSlFmNlVXRGtnYkI1SW9UQXpxSmcyWlpyNXBBY3F5enJ0WE1rRkhSWmdvYUxVam5sN1FwX0ljWm8yYzJWWk03T2QzRjIwcFZaVzJvejlOdGt3THZoSEhSMkc5WlNJQ3RHRjdhTkYwNVR5ZC1UeU1BVnZMYnM0ZFl1clRYaHNORjhQMVk4RmFuNjE4d0x6ZUVMOUkzS1BJLUctUXRUNHhWdw=="; // Replace with your own CSR let csr = "MIIBWjCCAQECAQAwRjFEMEIGA1UEAxM7d2ViLmVtb2ppdm90by5zZXJ2aWNlYWNjb3VudC5pZGVudGl0eS5saW5rZXJkLmNsdXN0ZXIubG9jYWwwWTATBgcqhkjOPQIBBggqhkjOPQMBBwNCAATKjgVXu6F+WCda3Bbq2ue6m3z6OTMfQ4Vnmekmvirip/XGyi2HbzRzjARnIzGlG8wo4EfeYBtd2MBCb50kP8F8oFkwVwYJKoZIhvcNAQkOMUowSDBGBgNVHREEPzA9gjt3ZWIuZW1vaml2b3RvLnNlcnZpY2VhY2NvdW50LmlkZW50aXR5LmxpbmtlcmQuY2x1c3Rlci5sb2NhbDAKBggqhkjOPQQDAgNHADBEAiAM7aXY8MRs/EOhtPo4+PRHuiNOV+nsmNDv5lvtJt8T+QIgFP5JAq0iq7M6ShRNkRG99ZquJ3L3TtLWMNVTPvqvvUE="; const data = { identity: "web.emojivoto.serviceaccount.identity.linkerd.cluster.local", token: token, certificate_signing_request: csr, }; stream.write(data); // This request takes around 2ms, so this sleep will mostly determine its final duration sleep(0.5); }; ``` This results in the following report: ``` scenarios: (100.00%) 1 scenario, 50 max VUs, 1m30s max duration (incl. graceful stop): * identity: 50 looping VUs for 1m0s (gracefulStop: 30s) data_received................: 6.3 MB 104 kB/s data_sent....................: 9.4 MB 156 kB/s grpc_req_duration............: avg=2.14ms min=873.93µs med=1.9ms max=12.89ms p(90)=3.13ms p(95)=3.86ms grpc_streams.................: 6000 99.355331/s grpc_streams_msgs_received...: 6000 99.355331/s grpc_streams_msgs_sent.......: 6000 99.355331/s iteration_duration...........: avg=503.16ms min=500.8ms med=502.64ms max=532.36ms p(90)=504.05ms p(95)=505.72ms iterations...................: 6000 99.355331/s vus..........................: 50 min=50 max=50 vus_max......................: 50 min=50 max=50 running (1m00.4s), 00/50 VUs, 6000 complete and 0 interrupted iterations ``` With the old defaults (QPS=5 and Burst=10), the latencies would be much higher and number of complete requests much lower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notes look good to me. We also need to bump the chart versions for the core extensions & control plane.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind adding the PR/issues links to the notes?
* Bump prometheus to v2.48.0 This gets rids of most CVEs: ```bash $ grype -q prom/prometheus:v2.47.0 NAME INSTALLED FIXED-IN TYPE VULNERABILITY SEVERITY github.com/docker/docker v24.0.4+incompatible 24.0.7 go-module GHSA-jq35-85cj-fj4p Medium github.com/prometheus/alertmanager v0.25.0 0.25.1 go-module GHSA-v86x-5fm3-5p7j Medium github.com/prometheus/alertmanager v0.25.0 go-module CVE-2023-40577 Medium go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.42.0 0.44.0 go-module GHSA-rcjv-mgp8-qvmr High golang.org/x/net v0.12.0 0.17.0 go-module GHSA-4374-p667-p6c8 High golang.org/x/net v0.12.0 0.17.0 go-module GHSA-qppj-fm5r-hxr3 Medium golang.org/x/net v0.12.0 0.13.0 go-module GHSA-2wrh-6pvc-2jm9 Medium google.golang.org/grpc v1.56.2 1.56.3 go-module GHSA-m425-mq94-257g High google.golang.org/grpc v1.56.2 1.56.3 go-module GHSA-qppj-fm5r-hxr3 Medium $ grype -q prom/prometheus:v2.48.0 NAME INSTALLED FIXED-IN TYPE VULNERABILITY SEVERITY github.com/docker/docker v24.0.6+incompatible 24.0.7 go-module GHSA-jq35-85cj-fj4p Medium ```
f5c66a5
to
75fb714
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet! Only thing left is the linkerd-control-plane
chart version bump.
* 0a72f1f Add imagePullSecrets to the multicluster chart. (#11287) * 284d76b service-mirror: support gateway resolving to multiple addresses (#11499) * 64bccd9 Improve klog (client-go logs) handling (#11632) * 6a07e2c Add ability to configure client-go's `QPS` and `Burst` settings (#11644) * e294c78 Bump prometheus to v2.48.0 (#11633) * b24b0e97c stable-2.14.6
75fb714
to
9893216
Compare
QPS
andBurst
settings (Add ability to configure client-go'sQPS
andBurst
settings #11644)