From 71b1fd16ce90533a6308e124189a1f22537bd3d5 Mon Sep 17 00:00:00 2001 From: Mark Rossett Date: Thu, 30 Jan 2025 13:57:49 -0800 Subject: [PATCH 01/12] 5100: Adding retroactive KEP for WinDSR and WinOverlay --- keps/prod-readiness/sig-windows/5100.yaml | 6 + .../README.md | 877 ++++++++++++++++++ .../kep.yaml | 45 + 3 files changed, 928 insertions(+) create mode 100644 keps/prod-readiness/sig-windows/5100.yaml create mode 100644 keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md create mode 100644 keps/sig-windows/5100-windows-dsr-and-overlay-support/kep.yaml diff --git a/keps/prod-readiness/sig-windows/5100.yaml b/keps/prod-readiness/sig-windows/5100.yaml new file mode 100644 index 00000000000..caf6d5278b7 --- /dev/null +++ b/keps/prod-readiness/sig-windows/5100.yaml @@ -0,0 +1,6 @@ +# The KEP must have an approver from the +# "prod-readiness-approvers" group +# of http://git.k8s.io/enhancements/OWNERS_ALIASES +kep-number: 5100 +beta: + approver: "" diff --git a/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md b/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md new file mode 100644 index 00000000000..984cf9dc46d --- /dev/null +++ b/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md @@ -0,0 +1,877 @@ + +# KEP-5100: [RETROACTIVE] DSR and Overlay support in Windows kube-proxy + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + +Add support for DSR (Direct Server Return) and Overlay networking mode support for Windows kube-proxy. + +Support for both of these features was added in K8s v1.14 without a KEP. +This KEP is to retroactively document the changes made to Windows kube-proxy to support these features and provide a path for promoting these features to GA. + +## Motivation + + + +DSR support was added to Windows Server 2019 as part of the May 2020 update. +DSR provides performance optimizations by allowing the return traffic routed through load balancers to bypass the load balancer and respond directly to the client; reducing load on the load balancer and also reducing overall latency. + +Overlay networking mode is a common networking mode used in Kubernetes clusters and is required by some for some important scenarios like network policy support with Calico CNI. +Adding support for overlay networking mode in Windows kube-proxy will allow users to use more CNI soluitons with Windows nodes. + +### Goals + + + +Enable DSR and overlay networking on Windows nodes running kube-proxy in Kubernetes clusters. + +### Non-Goals + + + +## Proposal + + + +DSR and Overlay networking mode support is already implemented in Windows kube-proxy and has been extensively tested in the Windows CI pipeline. +This proposal is to promote the existing implementations to GA. + +### User Stories (Optional) + + + +#### Story 1 + +As a cluster administrator, I want to enable DSR functionality on Windows nodes in order to reduce load in the Host Network Service and reduce latency for client requests. + +#### Story 2 + +As a cluster administrator, I want to be able to enable network policy on Windows nodes which requires overlay networking mode support in kube-proxy for some CNI solutions. + +### Notes/Constraints/Caveats (Optional) + + + +Overlay networking mode is not compatible with dualstack networking on Windows. + +If kube-proxy is started with both overlay networking mode and dualstack networking enabled, a warning message will be added and ip address space with be downgraded to ipv4 only. This is existing behavior and has not caused and any reported issues. + +### Risks and Mitigations + + + +Enabling DSR and overlay networking mode support in Windows kube-proxy both have very little risk. + +For DSR, the Windows Host Network Service handles all of the logic for managing network traffic; kube-proxy only needs to specify if DSR should be used when creating/sycing load balancer rules. +Additionally, DSR must be enabled with a kube-proxy command switch switch (--enable-dsr=true) disabling DSR is can be performed by redeploying kube-proxy on Windows nodes. + +Overlay networking support in Windows has been used in the Windows CI pipelines running release-informing jobs for many releases and is considered stable. + +## Design Details + + + +Since the functionality is already implemented, the design details section will cover the current implementation. + +### DSR Enablement + +DSR is enabled by passing `--enable-dsr=true` as a command line switch to the Windows kube-proxy. +Prior to GA, kube-proxy will ensure that `WinDSR=true` is specified in the feature-gates and will fail to start if DSR is enabled without that. + +Checks for ternminating and service enpoints handle DSR traffic differently than non-DSR traffic to adhere to behavior defined in [KEP-1669: Proxy Terminating Endpoints](https://github.com/kubernetes/enhancements/issues/1669) +- Local endpoints will be skipped when determining if all endpoints for a service are terminated if DSR is enabled and service type is load balancer. +- Non-local endpoints will be skipped when considering if all endpoints for a service are non-serving if DSR is enabled and service type is load balancer. + +Flags passed to HNS calls used for the following operators will be updated to include a flag indicating if DSR is enabled for all get, create, and update loadbalancer HNS calls. + + +### Overlay support + +To enable overlay networking on Windows nodes, HNS network created on the node prior to starting kube-proxy and specified by `$KUBE_NETWORK` should be of type `Overlay`. +Prior to GA `WinOverlay=true` must be specified in the kube-proxy feature gates. +If the specified network is of type `Overlay` and the the feature gate is not set, kube-proxy will log an error and fail to start. + +Addintionally, in overlay networking node, kube-proxy needs to know the source IP address of the traffic it is proxying by setting `--source-vip=$sourceVIP` on the kube-proxy command line. + +Creating the endpoint varries by CNI implementation and here are two examples: + +- For Flannel, the endpoint is created prior to starting kube-proxy like in this [example](https://github.com/kubernetes-sigs/sig-windows-tools/blob/3018559a4f396972a6c89b588f6b5fab030b72f6/hostprocess/flannel/kube-proxy/start.ps1#L6-L46) +- For Calico, the endpoint is crated by the node agent and queried by name prior to starting kube-proxy like in this [example](https://github.com/kubernetes-sigs/sig-windows-tools/blob/3018559a4f396972a6c89b588f6b5fab030b72f6/hostprocess/calico/kube-proxy/start.ps1#L76C1-L90C2) + +Once kube-proxy is running in overlay networking mode, the specified source VIP will sometimes be used on in load balancer policy rules based on the backend endpoints using the following logic: + +a) Backend endpoints are any IP's outside the cluster ==> Choose Node's IP as the source VIP +b) Backend endpoints are IP addresses of a remote node => Choose Node's IP as the source VIP +c) Everything else (Local POD's, Remote POD's, Node IP of current Node) ==> Choose the specified source VIP + +Everything else is handling by the Windows HNS. + +### Test Plan + + + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +Unit tests validating overlay networking behavior exist at https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/winkernel/proxier_test.go but must run on Windows machines so coverage is not reported in ci-kubernetes-coverage-unit. + + +##### Integration tests + + + + + +Functionality described in this KEP require Windows nodes and are primariily validated with unit and e2e tests. + +##### e2e tests + + + +All Windows nodes running kube-proxy in https://testgrid.k8s.io/sig-windows-master-release#capz-windows-master have DSR and overlay networking configured. + + +### Graduation Criteria + + +#### Alpha + +N/A - This feature is already implemented. + +#### Beta + +- Test passes on testgrid with WinDSR and Winoverlay enabled on Windows nodes are running regularly. +- Unit tests validating expected behavior for both DSR and overlay networking mode are added. + +#### GA + +- 2 or mroe CNI solutions support overlay networking mode for Windows nodes. + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: + - Components depending on the feature gate: +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + +###### Does enabling the feature change any default behavior? + + + +No. +For DSR, `--enable-dsr=true` must be passed as a kube-proxy command line switch to enable the functionality. +For Overlay networking mode, the + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-windows/5100-windows-dsr-and-overlay-support/kep.yaml b/keps/sig-windows/5100-windows-dsr-and-overlay-support/kep.yaml new file mode 100644 index 00000000000..ec5b4ab18db --- /dev/null +++ b/keps/sig-windows/5100-windows-dsr-and-overlay-support/kep.yaml @@ -0,0 +1,45 @@ +title: DSR and Overlay support in Windows kube-proxy +kep-number: 5100 +authors: + - "@marosset" +owning-sig: sig-windows +participating-sigs: + - sig-netowkr +status: implementable +creation-date: 2025-01-28 +reviewers: + - "@jsturtevant" + - "@mikezappa87" +approvers: + - "@jsturtevant" + - "@mikezappa87" + + +# The target maturity stage in the current dev cycle for this KEP. +# If the purpose of this KEP is to deprecate a user-visible feature +# and a Deprecated feature gates are added, they should be deprecated|disabled|removed. +stage: beta + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.33" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.14" + beta: "v1.33" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: WinDSR + components: + - kube-proxy + - name: WinOverlay + components: + - kube-proxy +disable-supported: true + +# The following PRR answers are required at beta release +metrics: From 2ee204024a16ff45efc3b82e09117c9617ab05f0 Mon Sep 17 00:00:00 2001 From: Mark Rossett Date: Thu, 30 Jan 2025 15:27:40 -0800 Subject: [PATCH 02/12] PRR updates --- .../README.md | 57 ++++++++++++++++--- 1 file changed, 50 insertions(+), 7 deletions(-) diff --git a/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md b/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md index 984cf9dc46d..62e21171b89 100644 --- a/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md +++ b/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md @@ -478,6 +478,11 @@ enhancement: cluster required to make on upgrade, in order to make use of the enhancement? --> +For DSR `--enable-dsr=true` must be passed as a kube-proxy command line switch to enable the functionality. +This means that the upgrade/downgrade strategy is the redeploy kube-proxy with the appropriate configuration. + +For overlay networking mode the entire cluster must be configured for overlay networking so cluster it is not possible for upgrade / downgrade this functionality on a per-node basis. + ### Version Skew Strategy +N/A - As long as the all nodes are configured for overlay networking mode, there is no version skew strategy required since networking APIs are not changing. + ## Production Readiness Review Questionnaire -- [ ] Feature gate (also fill in values in `kep.yaml`) - - Feature gate name: - - Components depending on the feature gate: -- [ ] Other +For DSR support: + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: WinDSR + - Components depending on the feature gate: kube-proxy +- [x] Other + - Describe the mechanism: DSR is enabled by passing `--enable-dsr=true` as a command line switch to the Windows kube-proxy. + - Will enabling / disabling the feature require downtime of the control + plane? no + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? Yes, there will be a slight period where network traffic might not be routed correctly while kube-proxy is restarted. + Kube-proxy will rules will be re-synced with/without DSR support when kube-proxy is starting up. + Nodes that handle network traffic show be drained before toggling DSR support. + +For overlay networking mode: + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: WinOverlay + - Components depending on the feature gate: kube-proxy +- [x] Other - Describe the mechanism: - Will enabling / disabling the feature require downtime of the control - plane? + plane? Yes and no - The HNS network used by kube-proxy must be re-created with the correct type before starting kube-proxy which can disrupt network traffic but also all nodes in a cluster must use the same network type so it is not possible to switch between overlay and bridge networking on a per-node basis. - Will enabling / disabling the feature require downtime or reprovisioning - of a node? + of a node? See above. ###### Does enabling the feature change any default behavior? @@ -554,7 +577,7 @@ automations, so be extremely careful here. No. For DSR, `--enable-dsr=true` must be passed as a kube-proxy command line switch to enable the functionality. -For Overlay networking mode, the +For overlay networking supprt, behavior changes only occur if the HNS network used by kube-proxy is of type `Overlay` which would only be done intentionally as part of joining nodes to a cluster. ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? @@ -569,8 +592,14 @@ feature. NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`. --> +For DSR, yes, DSR can be disabled by passing `--enable-dsr=false` as a kube-proxy command line switch and restarting kube-proxy. + +FOr Overlay, no, overlay networking mode cannot be disabled on a per-node basis. All nodes in a cluster must use the same network type so it is not possible to switch between overlay and bridge networking on a per-node basis. + ###### What happens if we reenable the feature if it was previously rolled back? +For DSR, kube-proxy should resync HNS rules and start using DSR again. + ###### Are there any tests for feature enablement/disablement? +For overlay, no, because the feature requires the cluster to be configured for overlay networking mode and cannot be enabled on a per-node basis. + +For DSR, no, but they can be added. + ### Rollout, Upgrade and Rollback Planning +For DSR a rollout or rollback shoudl not fail. Nodes can operator with DSR enabled or disabled per node in a cluster. + +For overlay networking mode support, a rollout can fail if the CNI configuration for the node and kube-proxy configuration are not in sync. This would cause nodes to never go into the Ready state. + ###### What specific metrics should inform a rollback? +Node ready state should be monitored to ensure nodes job the cluster and are properly configured to start running pods. + ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? +For DSR support yes, manual verification was done to ensure that DSR can be enabled and disabled on a node. + ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? +No + ### Monitoring Requirements -For DSR a rollout or rollback shoudl not fail. Nodes can operator with DSR enabled or disabled per node in a cluster. +For DSR a rollout or rollback should not fail. Nodes can operator with DSR enabled or disabled per node in a cluster. For overlay networking mode support, a rollout can fail if the CNI configuration for the node and kube-proxy configuration are not in sync. This would cause nodes to never go into the Ready state. diff --git a/keps/sig-windows/5100-windows-dsr-and-overlay-support/kep.yaml b/keps/sig-windows/5100-windows-dsr-and-overlay-support/kep.yaml index ec5b4ab18db..b1f043f8aa2 100644 --- a/keps/sig-windows/5100-windows-dsr-and-overlay-support/kep.yaml +++ b/keps/sig-windows/5100-windows-dsr-and-overlay-support/kep.yaml @@ -43,3 +43,4 @@ disable-supported: true # The following PRR answers are required at beta release metrics: + - "N/A" \ No newline at end of file From f8b9150bfa508280d143226ca99b05e0ae6db4be Mon Sep 17 00:00:00 2001 From: Mark Rossett Date: Tue, 4 Feb 2025 15:34:58 -0800 Subject: [PATCH 04/12] pr feedback --- .../5100-windows-dsr-and-overlay-support/README.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md b/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md index 64daa43acb8..9a05971fd40 100644 --- a/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md +++ b/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md @@ -196,6 +196,7 @@ demonstrate the interest in a KEP within the wider Kubernetes community. DSR support was added to Windows Server 2019 as part of the May 2020 update. DSR provides performance optimizations by allowing the return traffic routed through load balancers to bypass the load balancer and respond directly to the client; reducing load on the load balancer and also reducing overall latency. +More information on DSR on Windows can be found [here](https://techcommunity.microsoft.com/blog/networkingblog/direct-server-return-dsr-in-a-nutshell/693710). Overlay networking mode is a common networking mode used in Kubernetes clusters and is required by some for some important scenarios like network policy support with Calico CNI. Adding support for overlay networking mode in Windows kube-proxy will allow users to use more CNI soluitons with Windows nodes. @@ -230,6 +231,9 @@ nitty-gritty. DSR and Overlay networking mode support is already implemented in Windows kube-proxy and has been extensively tested in the Windows CI pipeline. This proposal is to promote the existing implementations to GA. +Additionally, DSR support on Windows is supported on both EKS and AKS. +Both DSR and overlay networking support have been used in the Windows CI pipelines running release-informing + ### User Stories (Optional) +If configured for use, both DSR and overlay networking will be used by any workloads that communicate with other pods/services in the cluster. + ###### How can someone using this feature know that it is working for their instance? +N/A + ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? +No + ### Dependencies +DNS and CNI solutions must be deployed in the cluster. + ### Scalability +No + ###### Will enabling / using this feature result in introducing new API types? +No + ###### Will enabling / using this feature result in any new calls to the cloud provider? +No + ###### Will enabling / using this feature result in increasing size or count of the existing API objects? +No + ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? +No + ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? +Enabling DSR will increase the number of IP addresses in use on each node by 1 for the VIP used to route return traffic. + ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? +No + ### Troubleshooting -For DSR a rollout or rollback should not fail. Nodes can operator with DSR enabled or disabled per node in a cluster. +For DSR a rollout or rollback should not fail. Nodes can operate with DSR enabled or disabled per node in a cluster. For overlay networking mode support, a rollout can fail if the CNI configuration for the node and kube-proxy configuration are not in sync. This would cause nodes to never go into the Ready state. @@ -657,7 +657,7 @@ What signals should users be paying attention to when the feature is young that might indicate a serious problem? --> -Node ready state should be monitored to ensure nodes job the cluster and are properly configured to start running pods. +Node ready state should be monitored to ensure nodes join the cluster and are properly configured to start running pods. ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? From 23e870e3936ab4b8f1ca707cc6b7fd6c0bd9b3bb Mon Sep 17 00:00:00 2001 From: Mark Rossetti Date: Mon, 10 Feb 2025 16:40:37 -0800 Subject: [PATCH 07/12] Filled out missing PRR section and misc updates Signed-off-by: Mark Rossetti --- keps/prod-readiness/sig-windows/5100.yaml | 2 +- .../README.md | 38 +++++++++++++++---- .../kep.yaml | 2 +- 3 files changed, 33 insertions(+), 9 deletions(-) diff --git a/keps/prod-readiness/sig-windows/5100.yaml b/keps/prod-readiness/sig-windows/5100.yaml index 8756496819f..3b3ef796083 100644 --- a/keps/prod-readiness/sig-windows/5100.yaml +++ b/keps/prod-readiness/sig-windows/5100.yaml @@ -5,4 +5,4 @@ kep-number: 5100 alpha: approver: "" beta: - approver: "@TBD" + approver: "@soltysh" diff --git a/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md b/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md index 04e1ec52bc2..b0ff12dae16 100644 --- a/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md +++ b/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md @@ -133,15 +133,15 @@ checklist items _must_ be updated for the enhancement to be released. Items marked with (R) are required *prior to targeting to a milestone / release*. -- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) -- [ ] (R) KEP approvers have approved the KEP status as `implementable` -- [ ] (R) Design details are appropriately documented -- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) +- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [X] (R) KEP approvers have approved the KEP status as `implementable` +- [X] (R) Design details are appropriately documented +- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) - [ ] e2e Tests for all Beta API Operations (endpoints) - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free -- [ ] (R) Graduation criteria is in place - - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [X] (R) Graduation criteria is in place + - [X] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [ ] (R) Production readiness review completed - [ ] (R) Production readiness review approved - [ ] "Implementation History" section is up-to-date for milestone @@ -394,7 +394,8 @@ For Beta and GA, add links to added tests together with links to k8s-triage for https://storage.googleapis.com/k8s-triage/index.html --> -Functionality described in this KEP require Windows nodes and are primariily validated with unit and e2e tests. +Functionality described in this KEP require Windows nodes and are primarily validated with unit and e2e tests. +The Kubernetes project does not currently have support for running integration tests for Windows specific code-paths. ##### e2e tests @@ -669,6 +670,17 @@ are missing a bunch of machinery and tooling and can't do that now. For DSR support yes, manual verification was done to ensure that DSR can be enabled and disabled on a node. +The steps for the manual validation went as followed: + +- Create a cluster with 1 Linux control plane node and 2 Windows worker nodes. +- Deployo a kube-proxy deamonSet with `--feature-gates=WinDSR=true` and `--enable-dsr=true` to Windows worker nodes. +- Deploy IIS (Internet Information Services) on both Windows work nodes and expose the service with a LoadBalancer service. +- Once the service IP became available, test that the service is from the each Windows node and outside of the cluster. +- Redeploy the kube-proxy deamonSet with `--enable-dsr=false` to Windows worker nodes. +- Wait for Kube-proxy to start and test that the service is still reachable from each Windows node and outside of the cluster. +- Redeploy the kube-proxy deamonSet with `--enable-dsr=true` to Windows worker nodes. +- Wait for Kube-proxy to start and test that the service is still reachable from each Windows node and outside of the cluster. + ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? +We have not observed any additional failure modes with DSR or overlay networking mode support on Windows nodes. + ###### What steps should be taken if SLOs are not being met to determine the problem? ## Implementation History @@ -928,12 +944,18 @@ Major milestones might include: - when the KEP was retired or superseded --> +- **2019-02-20** - DSR and overlay networking mode support added to Windows kube-proxy (k/k PR [#70896](https://github.com/kubernetes/kubernetes/pull/70896) +- **2025-01-28** - [KEP #5100](https://github.com/kubernetes/enhancements/issues/5100) created to document the changes made to Windows kube-proxy to support DSR and overlay networking mode support and provide a path for promoting these features to GA. + ## Drawbacks +The functionally described in this KEP is already implemented and used by various cloud providers so there are no drawbacks to not implementing it. +The drawbacks for not progressing the features to GA are that this functionality may get removed from kube-proxy in the future which would result in Windows not being able to support some CNI solutions (Calico networking with network policy support) and not being able to take advantage of DSR performance optimizations. + ## Alternatives +This functionality has already merged into k/k so other alternatives have not been considered. + ## Infrastructure Needed (Optional) -Unit tests validating overlay networking behavior exist at https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/winkernel/proxier_test.go but must run on Windows machines so coverage is not reported in ci-kubernetes-coverage-unit. +Kube-proxy for Windows must run on Windows machines so coverage is not reported in ci-kubernetes-coverage-unit. +This coverage data was run manually on a Windows Server 2022 machine: + +- k8s.io/kubernetes/pkg/proxy/winkernel: 2025-02-11 - 58.8% of statements ##### Integration tests @@ -450,6 +453,7 @@ N/A - This feature is already implemented. - Test passes on testgrid with WinDSR and Winoverlay enabled on Windows nodes are running regularly. - Unit tests validating expected behavior for both DSR and overlay networking mode are added. + - For DSR, unit tests validating feature gate is set correctly and that the correct flags are passed to HNS calls will also be added. #### GA @@ -627,7 +631,8 @@ https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05 For overlay, no, because the feature requires the cluster to be configured for overlay networking mode and cannot be enabled on a per-node basis. -For DSR, no, but they can be added. +For DSR, unit tests will be added to validate that DSR is enabled and disabled correctly and that the correct flags are passed to HNS calls for each case. +These will be required for the feature to move to beta. ### Rollout, Upgrade and Rollback Planning @@ -793,6 +798,9 @@ and creating new ones, as well as about cluster-level services (e.g. DNS): DNS and CNI solutions must be deployed in the cluster. +Both DSR and overlay networking modes are supported for all patch versions of Windows Server 2022 and Windows Server 2025. +DSR requires Windows Server 2019 with May 2020 updates (or later). + ### Scalability +A troubleshooting guide for general Windows networking issues can be found at https://learn.microsoft.com/en-us/troubleshoot/windows-server/software-defined-networking/troubleshoot-windows-server-software-defined-networking-stack + +https://github.com/microsoft/SDN/ contains some additional troubleshooting scripts to collect detailed information and can help in troubleshooting +- https://github.com/microsoft/SDN/blob/master/Kubernetes/windows/hns.v2.psm1 is a powershell module with cmdlets for inspecting HNS policies and endpoints +- https://github.com/microsoft/SDN/blob/master/Kubernetes/windows/helper.psm1 contains useful helper functions for troubleshooting +- https://github.com/microsoft/SDN/tree/master/Kubernetes/windows/debug contains various powershell scripts for enabling tracing, collectings stats and perf counterd, starting packet captures, etc + +Troubleshooting issues with Direct Server Return (DSR) on Windows: + +- Ensure that the kube-proxy command line switch `--enable-dsr=true` is set and `--feature-gates=WinDSR=true` is set. +- Inspect kube-proxy logs for any warnings or errors +- If everything looks correct, log onto the node and inspect the HNS rules to ensure DSR is enabled for the load balancer rules. + - Log onto the node and use `hnsdiag.exe list loadbalancers -d` to list all the load balancers and details about their rules. + You should see `IsDSR:true` for load balancer policies proxied by kube-proxy. + - You can use `hnsdiag.exe` to get detailed infromation about local networks and endpoints in addition to loadbalancers. +- If you are still having issues create an issue at https://github.com/microsoft/windows-containers + +Troubleshooting issues with overlay networking mode on Windows: + +- Ensure that the CNI solution has either created a HNS network of type `Overlay` or that instructions provided by the CNI solution have been followed to create the network. +- Ensure that the name of the network created above is passed to kube-proxy with the `$Env:KUBE_NETWORK` environment variable. +- Check kube-proxy logs for any warnings or errors. +- If everything looks correct, log onto the node and inspect the HNS rules to ensure that the source VIP is being used correctly. + - Log onto the node and use `hnsdiag.exe list loadbalancers -d` to list all the load balancers and details about their rules. + You should see the source VIP being used for load balancer policies proxied by kube-proxy. + - You can use `hnsdiag.exe` to get detailed infromation about local networks and endpoints in addition to loadbalancers. +- If you are still having issues create an issue at https://github.com/microsoft/windows-containers + ###### How does this feature react if the API server and/or etcd is unavailable? This feature does not change the functionality of kube-proxy or other Kubernetes components if the API server or etcd is unavailable. Kube-proxy would retain the existing behavior if the API server or etcd is unavailable, which would result in new Pod and Service endpoints not routing correctly on the nodes. From 5944887a2ab705ebc8ea7b7ff4778af9059b5ce5 Mon Sep 17 00:00:00 2001 From: Mark Rossett Date: Tue, 11 Feb 2025 15:50:41 -0800 Subject: [PATCH 11/12] spelling --- .../5100-windows-dsr-and-overlay-support/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md b/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md index 835ef6f3888..17baa3d4fe5 100644 --- a/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md +++ b/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md @@ -921,7 +921,7 @@ A troubleshooting guide for general Windows networking issues can be found at ht https://github.com/microsoft/SDN/ contains some additional troubleshooting scripts to collect detailed information and can help in troubleshooting - https://github.com/microsoft/SDN/blob/master/Kubernetes/windows/hns.v2.psm1 is a powershell module with cmdlets for inspecting HNS policies and endpoints - https://github.com/microsoft/SDN/blob/master/Kubernetes/windows/helper.psm1 contains useful helper functions for troubleshooting -- https://github.com/microsoft/SDN/tree/master/Kubernetes/windows/debug contains various powershell scripts for enabling tracing, collectings stats and perf counterd, starting packet captures, etc +- https://github.com/microsoft/SDN/tree/master/Kubernetes/windows/debug contains various powershell scripts for enabling tracing, collectings stats and perf counters, starting packet captures, etc Troubleshooting issues with Direct Server Return (DSR) on Windows: @@ -930,7 +930,7 @@ Troubleshooting issues with Direct Server Return (DSR) on Windows: - If everything looks correct, log onto the node and inspect the HNS rules to ensure DSR is enabled for the load balancer rules. - Log onto the node and use `hnsdiag.exe list loadbalancers -d` to list all the load balancers and details about their rules. You should see `IsDSR:true` for load balancer policies proxied by kube-proxy. - - You can use `hnsdiag.exe` to get detailed infromation about local networks and endpoints in addition to loadbalancers. + - You can use `hnsdiag.exe` to get detailed information about local networks and endpoints in addition to loadbalancers. - If you are still having issues create an issue at https://github.com/microsoft/windows-containers Troubleshooting issues with overlay networking mode on Windows: @@ -941,7 +941,7 @@ Troubleshooting issues with overlay networking mode on Windows: - If everything looks correct, log onto the node and inspect the HNS rules to ensure that the source VIP is being used correctly. - Log onto the node and use `hnsdiag.exe list loadbalancers -d` to list all the load balancers and details about their rules. You should see the source VIP being used for load balancer policies proxied by kube-proxy. - - You can use `hnsdiag.exe` to get detailed infromation about local networks and endpoints in addition to loadbalancers. + - You can use `hnsdiag.exe` to get detailed information about local networks and endpoints in addition to loadbalancers. - If you are still having issues create an issue at https://github.com/microsoft/windows-containers ###### How does this feature react if the API server and/or etcd is unavailable? From b85c91dc724238acf99b087e71c07030b0b66e79 Mon Sep 17 00:00:00 2001 From: Mark Rossett Date: Tue, 11 Feb 2025 15:51:41 -0800 Subject: [PATCH 12/12] fixup --- keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md b/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md index 17baa3d4fe5..a7da98a75ff 100644 --- a/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md +++ b/keps/sig-windows/5100-windows-dsr-and-overlay-support/README.md @@ -262,7 +262,7 @@ This might be a good place to talk about core concepts and how they relate. Overlay networking mode is not compatible with dualstack networking on Windows. -If kube-proxy is started with both overlay networking mode and dualstack networking enabled, a warning message will be added and ip address space with be downgraded to ipv4 only. This is existing behavior and has not caused and any reported issues. +If kube-proxy is started with both overlay networking mode and dualstack networking enabled, a warning message will be added and ip address space with be downgraded to ipv4 only. This is existing behavior and has not caused any reported issues. ### Risks and Mitigations