Remove watch timeout to allow call staggering #1296

michaely-cb · 2024-06-12T04:40:46Z

The watch calls from multus were reconnecting to the API server every minute, due to a one-minute timeout specified on the rest config. Reconnecting every minute imposes unnecessary load on the api server and watches with fixed timeouts won't be temporally staggered to make the api server load even. For watch calls, we should completely delegate the reconnections to client-go. Watches from other components (kubelet, kube-scheduler, cilium) are doing this delegation already.

Reference: https://github.com/kubernetes/client-go/blob/03443e7ede0e50d195b8669103ce082e735c6b94/tools/cache/reflector.go#L52-L56

Pod watch:

// prior to this change
2024-06-07T17:49:38.929150Z -> 2024-06-07T17:50:38.929483Z -> multus-daemon/v0.0.0 (linux/amd64) kubernetes/$Format -> watch -> /api/v1/pods?allowWatchBookmarks=true&fieldSelector=spec.nodeName%3Dpytest-ci-1717689938-worker&resourceVersion=312906&timeout=8m14s&timeoutSeconds=494&watch=true -> 200
2024-06-07T17:50:38.929684Z -> 2024-06-07T17:51:38.930434Z -> multus-daemon/v0.0.0 (linux/amd64) kubernetes/$Format -> watch -> /api/v1/pods?allowWatchBookmarks=true&fieldSelector=spec.nodeName%3Dpytest-ci-1717689938-worker&resourceVersion=312906&timeout=9m12s&timeoutSeconds=552&watch=true -> 200

// with this change
2024-06-12T03:44:13.024297Z -> 2024-06-12T03:53:26.025634Z -> multus-daemon/v0.0.0 (linux/amd64) kubernetes/$Format -> watch -> /api/v1/pods?allowWatchBookmarks=true&fieldSelector=spec.nodeName%3Dpytest-ci-1718094202-worker&resourceVersion=219877&timeout=9m13s&timeoutSeconds=553&watch=true -> 200
2024-06-12T03:53:26.026164Z -> 2024-06-12T03:58:38.028134Z -> multus-daemon/v0.0.0 (linux/amd64) kubernetes/$Format -> watch -> /api/v1/pods?allowWatchBookmarks=true&fieldSelector=spec.nodeName%3Dpytest-ci-1718094202-worker&resourceVersion=219883&timeout=5m12s&timeoutSeconds=312&watch=true -> 200

Nad watch:

// prior to this change
2024-06-07T17:47:38.871806Z -> 2024-06-07T17:48:38.871976Z -> multus-daemon/v0.0.0 (linux/amd64) kubernetes/$Format -> watch -> /apis/k8s.cni.cncf.io/v1/network-attachment-definitions?allowWatchBookmarks=true&resourceVersion=310731&timeout=8m50s&timeoutSeconds=530&watch=true -> 200
2024-06-07T17:48:38.872269Z -> 2024-06-07T17:49:38.873034Z -> multus-daemon/v0.0.0 (linux/amd64) kubernetes/$Format -> watch -> /apis/k8s.cni.cncf.io/v1/network-attachment-definitions?allowWatchBookmarks=true&resourceVersion=310731&timeout=7m32s&timeoutSeconds=452&watch=true -> 200

// with this change
2024-06-13T09:36:07.248638Z -> 2024-06-13T09:44:26.253022Z -> multus-daemon/v0.0.0 (linux/amd64) kubernetes/$Format -> watch -> /apis/k8s.cni.cncf.io/v1/network-attachment-definitions?allowWatchBookmarks=true&resourceVersion=550160&timeout=8m19s&timeoutSeconds=499&watch=true -> 200
2024-06-13T09:44:26.253582Z -> 2024-06-13T09:54:11.256301Z -> multus-daemon/v0.0.0 (linux/amd64) kubernetes/$Format -> watch -> /apis/k8s.cni.cncf.io/v1/network-attachment-definitions?allowWatchBookmarks=true&resourceVersion=552157&timeout=9m45s&timeoutSeconds=585&watch=true -> 200

michaely-cb · 2024-06-12T04:48:25Z

Hi @dougbtv @s1061123. Can I get a review on this PR please? Thanks!

go.mod

dougbtv · 2024-06-20T14:00:35Z

This sure sounds like an excellent fix, and overall I'm in favor of it -- is there any way that we can validate that it does indeed operate as expected by reducing the API calls? e.g. via end to end tests, or, even manually? thanks!

michaely-cb · 2024-06-20T14:25:31Z

is there any way that we can validate that it does indeed operate as expected by reducing the API calls? e.g. via end to end tests, or, even manually?

What I did is to manually turn on the API server audit logs and see the call pattern changes. I have captured the call patterns before and after in the PR description, where we could see the minutely reconnections were happening prior to this change and not after. In the later calls, we can also see the reconnection time aligns with the random timeout client-go was specifying in the request parameters.

michaely-cb · 2024-06-27T15:47:44Z

@dougbtv Mind taking another look and rerun CI please?

pkg/k8sclient/kubeconfig.go

adrianchiris

overall lgtm, left one question for my own understanding :)

github-actions · 2024-11-11T02:35:05Z

This pull request is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Remove watch timeout to allow call staggering

2286fba

yifeng-cerebras reviewed Jun 12, 2024

View reviewed changes

go.mod Show resolved Hide resolved

joykent99 approved these changes Jun 12, 2024

View reviewed changes

fix tests

c4db076

adrianchiris reviewed Aug 12, 2024

View reviewed changes

pkg/k8sclient/kubeconfig.go Show resolved Hide resolved

adrianchiris reviewed Aug 12, 2024

View reviewed changes

github-actions bot added the Stale label Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove watch timeout to allow call staggering #1296

Remove watch timeout to allow call staggering #1296

michaely-cb commented Jun 12, 2024 •

edited

Loading

michaely-cb commented Jun 12, 2024

dougbtv commented Jun 20, 2024

michaely-cb commented Jun 20, 2024

michaely-cb commented Jun 27, 2024

adrianchiris left a comment

github-actions bot commented Nov 11, 2024

Remove watch timeout to allow call staggering #1296

Are you sure you want to change the base?

Remove watch timeout to allow call staggering #1296

Conversation

michaely-cb commented Jun 12, 2024 • edited Loading

michaely-cb commented Jun 12, 2024

dougbtv commented Jun 20, 2024

michaely-cb commented Jun 20, 2024

michaely-cb commented Jun 27, 2024

adrianchiris left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 11, 2024

michaely-cb commented Jun 12, 2024 •

edited

Loading