Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP 1880: graduation to GA #4983

Merged
merged 1 commit into from
Dec 13, 2024
Merged

Conversation

aojea
Copy link
Member

@aojea aojea commented Nov 27, 2024

There was only one bug opened during this time kubernetes/kubernetes#127588 that was caused by a copy and paste error.

It is available in GKE https://cloud.google.com/kubernetes-engine/docs/how-to/use-beta-apis and used in production clusters.

It can be used by OSS users with installers that allow to set the feature gates and enable the beta apis, see kops kubernetes/test-infra#33864
and blog about how to use to solve overlapping problems https://akarat.xyz/Changing-kubernetes-CIDR-live-on-production/

It is been tested by the community spidernet-io/spiderpool#4089 (comment)

There is also an external blog about it https://engineering.doit.com/scaling-kubernetes-how-to-seamlessly-expand-service-ip-ranges-246f392112f8

The feature has been 2 releases in beta, v1.31 and v1.32, it is being used in production and there is proof of usage and testing beyond kubernetes project, it should be enough signal to move it to GA and avoid having a permanent beta API

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 27, 2024
@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Nov 27, 2024
@k8s-ci-robot k8s-ci-robot requested a review from thockin November 27, 2024 17:52
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 27, 2024
@aojea aojea mentioned this pull request Nov 28, 2024
12 tasks
@aojea
Copy link
Member Author

aojea commented Nov 28, 2024

/assign @thockin @danwinship

@aojea
Copy link
Member Author

aojea commented Dec 2, 2024

/assign @soltysh

for PRR

@danwinship
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 3, 2024
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left several comments to address.

@@ -603,6 +586,8 @@ Files:
- test/integration/servicecidr/allocator_test.go
- test/integration/servicecidr/migration_test.go
- test/integration/servicecidr/servicecidr_test.go
- test/integration/servicecidr/feature_enable_disable_test.go
- test/integration/servicecidr/perf_test.go

##### e2e tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there more e2e-s planned? From checking the codebase I see only one currently. For starters I'm missing e2e covering GA API endpoints (see my earlier comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the GA criteria I'm missing all 3 elements pointed out in this update:

  • 2 examples of real-world usage
  • More rigorous forms of testing—e.g., downgrade tests and scalability tests
  • Allowing time for feedback

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 examples of real-world usage

I didn't know how to do this, since this is an opt-in feature is not possible to get telemetry, I know there are GKE customers using it and also that Kops is able to do it, see the description of this PR to find the examples I found of people using or testing it.

More rigorous forms of testing—e.g., downgrade tests and scalability tests

This is a core feature, means that once enable it inherit all the scalability testing, upgrade/downgrade is added as integration

Allowing time for feedback

It went beta in 1.31 and we usually leave one release for feedback, I got good internal feedback from GKE users .... I also want to avoid permanent betas

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I update this information on the doc or is enough with the description on the issue? I linked several places that I think should be proof the feature is being used in production ... we generally add the diversity of implementations to avoid favoring vendors, but in this case this feature is a core functionality that will be used everywhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For posterity, discussed this with Antonio on slack, I suggested to link the public references (Kops from what he writes above, plus mention GKE customers).

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 9, 2024
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
the PRR section

@@ -603,6 +586,8 @@ Files:
- test/integration/servicecidr/allocator_test.go
- test/integration/servicecidr/migration_test.go
- test/integration/servicecidr/servicecidr_test.go
- test/integration/servicecidr/feature_enable_disable_test.go
- test/integration/servicecidr/perf_test.go

##### e2e tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For posterity, discussed this with Antonio on slack, I suggested to link the public references (Kops from what he writes above, plus mention GKE customers).

- Allowing time for feedback
- The feature was beta in 1.31, it has been tested by different projects and enabled in one platform [with only one bug reported](https://github.com/kubernetes/kubernetes/issues/127588).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great - thank you!

| 1.34 | GA (there are no bitmaps running) | GA on (also delete old bitmap)|
| 1.35 | remove feature gate | remove feature gate |
| 1.34 | GA (there are no bitmaps running) | GA (also delete old bitmap)|
| 1.36 | remove feature gate | GA |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: just for visibility I'd probably add the row for 1.35, some folks might not catch that it's being skipped 😉

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, danwinship, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 13, 2024
@danwinship
Copy link
Contributor

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 13, 2024
@danwinship
Copy link
Contributor

/hold cancel
/lgtm

After some offline discussion, this needs a "needs action" release note warning infrastructure providers that they should install an admission hook to disable this feature if they don't want to allow it, or if their cluster contains other components that need to know all of the active service CIDRs but which haven't been updated to know about the ServiceCIDR API yet. Antonio has a sample admission webhook.

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Dec 13, 2024
@k8s-ci-robot k8s-ci-robot merged commit be6efdd into kubernetes:master Dec 13, 2024
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Dec 13, 2024
@pohly
Copy link
Contributor

pohly commented Dec 13, 2024

Antonio has a sample admission webhook.

Could VAP be used instead? It's simpler to install than a webhook.

@aojea
Copy link
Member Author

aojea commented Dec 13, 2024

Could VAP be used instead? It's simpler to install than a webhook.

I would work on document this properly and to put the proper guardrails, I will also consult with apimachinery if they prefer to use an admission controller

@aojea
Copy link
Member Author

aojea commented Dec 13, 2024

VAP to block any ServiceCIDR that is not the default

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: "servicecidrs.default"
spec:
  failurePolicy: Fail
  matchConstraints:
    resourceRules:
    - apiGroups:   ["networking.k8s.io"]
      apiVersions: ["v1","v1beta1"]
      operations:  ["CREATE", "UPDATE"]
      resources:   ["servicecidrs"]
  validations:
    - expression: "object.metadata.name == 'kubernetes'"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: "servicecidrs-binding"
spec:
  policyName: "servicecidrs.default"
  validationActions: [Deny,Audit]

Tested with kind

kind-config

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
runtimeConfig:
  "api/beta" : "true"
featureGates:
  "MultiCIDRServiceAllocator": true
  1. Create cluster
kind create cluster --name servieccidr --config kind-servicecidr.yaml

2.Apply the policies
3. Try to create a new ServiceCIDR

apiVersion: networking.k8s.io/v1beta1
kind: ServiceCIDR
metadata:
  name: newcidr1
spec:
  cidrs:
  - 10.96.0.0/24

it is denied

 kubectl apply -f servicecidr.yaml
The servicecidrs "newcidr1" is invalid: : ValidatingAdmissionPolicy 'servicecidrs.default' with binding 'servicecidrs-binding' denied request: failed expression: object.metadata.name == 'kubernetes'

TODO:

Allow to reference parameters, so admins can define the range of IPs available

@@ -20,18 +20,18 @@ see-also:
replaces:

# The target maturity stage in the current dev cycle for this KEP.
stage: beta
stage: stable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget the details. This was beta, but off by default. So it didn't get a ton of usage. Is going straight to GA safe? Will we set the lock-to-default?

Copy link
Member Author

@aojea aojea Dec 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can avoid setting the lock to default, that will allow cluster admins to disable the feature gate entirely without needing to use a webhook or VAP #4983 (comment) , cc: @danwinship

@aojea
Copy link
Member Author

aojea commented Dec 14, 2024

Ah more complex example of VAP that allows to define allowed ranges of CIDRs, to control the ranges users can create and avoid footguns by creating overlapping IP ranges

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: "servicecidrs.default"
spec:
  failurePolicy: Fail
  matchConstraints:
    resourceRules:
    - apiGroups:   ["networking.k8s.io"]
      apiVersions: ["v1","v1beta1"]
      operations:  ["CREATE", "UPDATE"]
      resources:   ["servicecidrs"]
  matchConditions:
  - name: 'exclude-default-servicecidr'
    expression: "object.metadata.name != 'kubernetes'"
  variables:
  - name: allowed
    expression: "['10.96.0.0/16','2001:db8::/64']"
  validations:
  - expression: "object.spec.cidrs.all(i , variables.allowed.exists(j , cidr(j).containsCIDR(i)))"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: "servicecidrs-binding"
spec:
  policyName: "servicecidrs.default"
  validationActions: [Deny,Audit]

Test:

apiVersion: networking.k8s.io/v1beta1
kind: ServiceCIDR
metadata:
  name: newcidr1
spec:
  cidrs:
  - 10.96.0.0/24

It is within range so it is allowed

 kubectl apply -f servicecidr.yaml
servicecidr.networking.k8s.io/newcidr1 created
apiVersion: networking.k8s.io/v1beta1
kind: ServiceCIDR
metadata:
  name: newcidr2
spec:
  cidrs:
  - 10.96.0.0/24
  - fd00:1::/64

has one cidr out of the allowed list so is denied

kubectl apply -f servicecidr2.yaml
The servicecidrs "newcidr2" is invalid: : ValidatingAdmissionPolicy 'servicecidrs.default' with binding 'servicecidrs-binding' denied request: failed expression: object.spec.cidrs.all(i , variables.allowed.exists(j , cidr(j).containsCIDR(i)))

Changing the range to an allowed one

apiVersion: networking.k8s.io/v1beta1
kind: ServiceCIDR
metadata:
  name: newcidr2
spec:
  cidrs:
  - 10.96.0.0/24
  - 2001:db8::/112

now it is allowed

 kubectl apply -f servicecidr2.yaml
servicecidr.networking.k8s.io/newcidr2 created

Thanks @JoelSpeed for this fantastic library to handle IPs and CIDRs with CEL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants