Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stable-2.14.6 #11707

Merged
merged 6 commits into from
Dec 7, 2023
Merged

stable-2.14.6 #11707

merged 6 commits into from
Dec 7, 2023

Conversation

olix0r
Copy link
Member

@olix0r olix0r commented Dec 6, 2023

lhaussknecht and others added 4 commits December 6, 2023 21:42
Fixes #11285

Add an imagePullSecrets value to the linkerd-multicluster and linkerd-multicluster-link charts so that imagePullSecrets can be specified for the multicluster extension.

---------

Signed-off-by: Louis Haußknecht <[email protected]>
If the gateway address was a name resolving to multiple addresses,
previously service-mirror took only one of those address. This behavior
could led to downtime in case the gateway behind the IP picked by service-mirror
is down. This commit fix this behavior by considering all the IPs
resolved. 

Now that we consider all the IPs resolved by the DNS, we make sure they
are sorted lexicographically so that resolveGatewayAddress answer a
stable output.

Signed-off-by: Arthur Outhenin-Chalandre <[email protected]>
* Improve klog (client-go logs) handling

Currently log entries in the go-based controllers originated in the client-go library are only visible if the controller’s log level is `debug`. If the log level is lower (`info`, `warn`, `error`), we lose possibly important information. If the log level is `debug` we receive a lot of entries, mostly irrelevant.

client-go uses [klog](https://github.com/kubernetes/klog) for its logs, which relies on [logr](https://github.com/go-logr/logr) as its backend, which is just an interface allowing different implementations, one of which is logrus, which we use in our controllers. So in this change we bring the [logrusr](https://github.com/bombsimon/logrusr) library which implements logrus for logr.

The verbosity level for klog is also tweaked a little bit, according to k8s [logging conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/logging.md).

Also, given that logging output is now handled completely by logrus, some of the klog config flags are no longer required.

As an example of the new behavior in the proxy-injector logs, when removing the RBAC to list pods, we now see the complaints coming from client-go, which were invisible before at `info` level:

```console
time="2023-11-17T20:58:30Z" level=info msg="running version dev-94ad50cd-alpeb"
time="2023-11-17T20:58:30Z" level=info msg="starting admin server on :9995"
time="2023-11-17T20:58:30Z" level=info msg="listening at :8443"
time="2023-11-17T20:58:30Z" level=info msg="waiting for caches to sync"
time="2023-11-17T20:58:30Z" level=info msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope"
time="2023-11-17T20:58:30Z" level=error msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" error="<nil>"
time="2023-11-17T20:58:32Z" level=info msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope"
time="2023-11-17T20:58:32Z" level=error msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" error="<nil>"
time="2023-11-17T20:58:34Z" level=info msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope"
time="2023-11-17T20:58:34Z" level=error msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" error="<nil>"

time="2023-11-17T20:58:40Z" level=info msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope"
time="2023-11-17T20:58:40Z" level=error msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" error="<nil>"
time="2023-11-17T20:58:48Z" level=info msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope"
time="2023-11-17T20:58:48Z" level=error msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" error="<nil>"
time="2023-11-17T20:59:05Z" level=info msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope"
time="2023-11-17T20:59:05Z" level=error msg="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: pods is forbidden: User \"system:serviceaccount:linkerd:linkerd-proxy-injector\" cannot list resource \"pods\" in API group \"\" at the cluster scope" error="<nil>"
time="2023-11-17T20:59:30Z" level=fatal msg="failed to sync caches"
```
* Add ability to configure client-go's `QPS` and `Burst` settings

## Problem and Symptoms

When having a very large number of proxies request identity in a short period of time (e.g. during large node scaling events), the identity controller will attempt to validate the tokens sent by the proxies at a rate surpassing client-go's the default request rate threshold, triggering client-side throttling, which will delay the proxies initialization, and even failing their startup (after a 2m timeout). The identity controller will surface this through log entries like this:

```
time="2023-11-08T19:50:45Z" level=error msg="error validating token for web.emojivoto.serviceaccount.identity.linkerd.cluster.local: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline"
```

## Solution

Client-go's default `QPS` is 5 and `Burst` is 10. This PR exposes those settings as entries in `values.yaml` with defaults of 100 and 200 respectively. Note this only applies to the identity controller, as it's the only controller performing direct requests to the `kube-apiserver` in a hot path. The other controllers mostly rely in informers, and direct calls are sporadic.

## Observability

The `QPS` and `Burst` settings used are exposed both as a log entry as soon as the controller starts, and as in the new metric gauges `http_client_qps` and `http_client_burst`

## Testing

You can use the following K6 script, which simulates 6k calls to the `Certify` service during one minute from emojivoto's web pod. Before running this you need to:

- Put the identity.proto and [all the other proto files](https://github.com/linkerd/linkerd2-proxy-api/tree/v0.11.0/proto) in the same directory.
- Edit the [checkRequest](https://github.com/linkerd/linkerd2/blob/edge-23.11.3/pkg/identity/service.go#L266) function and add logging statements to figure the `token` and `csr` entries you can use here, that will be shown as soon as a web pod starts.

```javascript
import { Client, Stream } from 'k6/experimental/grpc';
import { sleep } from 'k6';

const client = new Client();
client.load(['.'], 'identity.proto');

// This always holds:
// req_num = (1 / req_duration ) * duration * VUs
// Given req_duration (0.5s) test duration (1m) and the target req_num (6k), we
// can solve for the required VUs:
// VUs = req_num * req_duration / duration
// VUs = 6000 * 0.5 / 60 = 50
export const options = {
  scenarios: {
    identity: {
      executor: 'constant-vus',
      vus: 50,
      duration: '1m',
    },
  },
};

export default () => {
  client.connect('localhost:8080', {
    plaintext: true,
  });

  const stream = new Stream(client, 'io.linkerd.proxy.identity.Identity/Certify');

  // Replace with your own token
  let token = "ZXlKaGJHY2lPaUpTVXpJMU5pSXNJbXRwWkNJNkluQjBaV1pUZWtaNWQyVm5OMmxmTTBkV2VUTlhWSFpqTmxwSmJYRmtNMWRSVEhwNVNHWllhUzFaZDNNaWZRLmV5SmhkV1FpT2xzaWFXUmxiblJwZEhrdWJEVmtMbWx2SWwwc0ltVjRjQ0k2TVRjd01EWTRPVFk1TUN3aWFXRjBJam94TnpBd05qQXpNamt3TENKcGMzTWlPaUpvZEhSd2N6b3ZMMnQxWW1WeWJtVjBaWE11WkdWbVlYVnNkQzV6ZG1NdVkyeDFjM1JsY2k1c2IyTmhiQ0lzSW10MVltVnlibVYwWlhNdWFXOGlPbnNpYm1GdFpYTndZV05sSWpvaVpXMXZhbWwyYjNSdklpd2ljRzlrSWpwN0ltNWhiV1VpT2lKM1pXSXRPRFUxT1dJNU4yWTNZeTEwYldJNU5TSXNJblZwWkNJNklqaGlZbUV5WWpsbExXTXdOVGN0TkRnMk1TMWhNalZsTFRjelpEY3dOV1EzWmpoaU1TSjlMQ0p6WlhKMmFXTmxZV05qYjNWdWRDSTZleUp1WVcxbElqb2lkMlZpSWl3aWRXbGtJam9pWm1JelpUQXlNRE10TmpZMU55MDBOMk0xTFRoa09EUXRORGt6WXpBM1lXUTJaak0zSW4xOUxDSnVZbVlpT2pFM01EQTJNRE15T1RBc0luTjFZaUk2SW5ONWMzUmxiVHB6WlhKMmFXTmxZV05qYjNWdWREcGxiVzlxYVhadmRHODZkMlZpSW4wLnlwMzAzZVZkeHhpamxBOG1wVjFObGZKUDB3SC03RmpUQl9PcWJ3NTNPeGU1cnNTcDNNNk96VWR6OFdhYS1hcjNkVVhQR2x2QXRDRVU2RjJUN1lKUFoxVmxxOUFZZTNvV2YwOXUzOWRodUU1ZDhEX21JUl9rWDUxY193am9UcVlORHA5ZzZ4ZFJNcW9reGg3NE9GNXFjaEFhRGtENUJNZVZ6a25kUWZtVVZwME5BdTdDMTZ3UFZWSlFmNlVXRGtnYkI1SW9UQXpxSmcyWlpyNXBBY3F5enJ0WE1rRkhSWmdvYUxVam5sN1FwX0ljWm8yYzJWWk03T2QzRjIwcFZaVzJvejlOdGt3THZoSEhSMkc5WlNJQ3RHRjdhTkYwNVR5ZC1UeU1BVnZMYnM0ZFl1clRYaHNORjhQMVk4RmFuNjE4d0x6ZUVMOUkzS1BJLUctUXRUNHhWdw==";
  // Replace with your own CSR
  let csr = "MIIBWjCCAQECAQAwRjFEMEIGA1UEAxM7d2ViLmVtb2ppdm90by5zZXJ2aWNlYWNjb3VudC5pZGVudGl0eS5saW5rZXJkLmNsdXN0ZXIubG9jYWwwWTATBgcqhkjOPQIBBggqhkjOPQMBBwNCAATKjgVXu6F+WCda3Bbq2ue6m3z6OTMfQ4Vnmekmvirip/XGyi2HbzRzjARnIzGlG8wo4EfeYBtd2MBCb50kP8F8oFkwVwYJKoZIhvcNAQkOMUowSDBGBgNVHREEPzA9gjt3ZWIuZW1vaml2b3RvLnNlcnZpY2VhY2NvdW50LmlkZW50aXR5LmxpbmtlcmQuY2x1c3Rlci5sb2NhbDAKBggqhkjOPQQDAgNHADBEAiAM7aXY8MRs/EOhtPo4+PRHuiNOV+nsmNDv5lvtJt8T+QIgFP5JAq0iq7M6ShRNkRG99ZquJ3L3TtLWMNVTPvqvvUE=";

  const data = {
		identity:                     "web.emojivoto.serviceaccount.identity.linkerd.cluster.local",
    token:                        token,
		certificate_signing_request:  csr,
  };
  stream.write(data);

  // This request takes around 2ms, so this sleep will mostly determine its final duration
  sleep(0.5);
};
```

This results in the following report:

```
scenarios: (100.00%) 1 scenario, 50 max VUs, 1m30s max duration (incl. graceful stop):
           * identity: 50 looping VUs for 1m0s (gracefulStop: 30s)

     data_received................: 6.3 MB 104 kB/s
     data_sent....................: 9.4 MB 156 kB/s
     grpc_req_duration............: avg=2.14ms   min=873.93µs med=1.9ms    max=12.89ms  p(90)=3.13ms   p(95)=3.86ms
     grpc_streams.................: 6000   99.355331/s
     grpc_streams_msgs_received...: 6000   99.355331/s
     grpc_streams_msgs_sent.......: 6000   99.355331/s
     iteration_duration...........: avg=503.16ms min=500.8ms  med=502.64ms max=532.36ms p(90)=504.05ms p(95)=505.72ms
     iterations...................: 6000   99.355331/s
     vus..........................: 50     min=50      max=50
     vus_max......................: 50     min=50      max=50

running (1m00.4s), 00/50 VUs, 6000 complete and 0 interrupted iterations
```

With the old defaults (QPS=5 and Burst=10), the latencies would be much higher and number of complete requests much lower.
@olix0r olix0r requested a review from a team as a code owner December 6, 2023 22:02
@olix0r olix0r changed the title v2.14.6 stable-2.14.6 Dec 6, 2023
Copy link
Member

@mateiidavid mateiidavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes look good to me. We also need to bump the chart versions for the core extensions & control plane.

Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind adding the PR/issues links to the notes?

* Bump prometheus to v2.48.0

This gets rids of most CVEs:

```bash
$ grype -q prom/prometheus:v2.47.0
NAME                                                           INSTALLED             FIXED-IN  TYPE       VULNERABILITY        SEVERITY
github.com/docker/docker                                       v24.0.4+incompatible  24.0.7    go-module  GHSA-jq35-85cj-fj4p  Medium
github.com/prometheus/alertmanager                             v0.25.0               0.25.1    go-module  GHSA-v86x-5fm3-5p7j  Medium
github.com/prometheus/alertmanager                             v0.25.0                         go-module  CVE-2023-40577       Medium
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp  v0.42.0               0.44.0    go-module  GHSA-rcjv-mgp8-qvmr  High
golang.org/x/net                                               v0.12.0               0.17.0    go-module  GHSA-4374-p667-p6c8  High
golang.org/x/net                                               v0.12.0               0.17.0    go-module  GHSA-qppj-fm5r-hxr3  Medium
golang.org/x/net                                               v0.12.0               0.13.0    go-module  GHSA-2wrh-6pvc-2jm9  Medium
google.golang.org/grpc                                         v1.56.2               1.56.3    go-module  GHSA-m425-mq94-257g  High
google.golang.org/grpc                                         v1.56.2               1.56.3    go-module  GHSA-qppj-fm5r-hxr3  Medium

$ grype -q prom/prometheus:v2.48.0
NAME                      INSTALLED             FIXED-IN  TYPE       VULNERABILITY        SEVERITY
github.com/docker/docker  v24.0.6+incompatible  24.0.7    go-module  GHSA-jq35-85cj-fj4p  Medium
```
@olix0r olix0r force-pushed the ver/stable-2.14.6 branch from f5c66a5 to 75fb714 Compare December 7, 2023 14:22
Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet! Only thing left is the linkerd-control-plane chart version bump.

* 0a72f1f Add imagePullSecrets to the multicluster chart. (#11287)
* 284d76b service-mirror: support gateway resolving to multiple addresses (#11499)
* 64bccd9 Improve klog (client-go logs) handling (#11632)
* 6a07e2c Add ability to configure client-go's `QPS` and `Burst` settings (#11644)
* e294c78 Bump prometheus to v2.48.0 (#11633)
* b24b0e97c stable-2.14.6
@olix0r olix0r force-pushed the ver/stable-2.14.6 branch from 75fb714 to 9893216 Compare December 7, 2023 16:02
@olix0r olix0r merged commit 11b5ab8 into release/stable-2.14 Dec 7, 2023
35 checks passed
@olix0r olix0r deleted the ver/stable-2.14.6 branch December 7, 2023 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants