Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support to VPC resource controller's CNINode CRD #2442

Merged
merged 1 commit into from
Jul 9, 2023

Conversation

haouc
Copy link
Contributor

@haouc haouc commented Jun 26, 2023

What type of PR is this?
We want to replace node labels when enabling SGP feature through VPC CNI. VPC resource controller will create the new CRD objects with one to one mapping to k8s nodes. VPC CNI is responsible to patch the CR to add features.
The resource controller's PR is aws/amazon-vpc-resource-controller-k8s#270

The tests have been done with resource controller change.

Which issue does this PR fix:

What does this PR do / Why do we need it:
We need to switch to the new CRD based implementation.

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:

Testing done on this change:

  • new install
  • upgrade from older CNI 1.11.x and 1.12.x
  • downgrade to older CNI version
  • scale tests with 300 nodes

Automation added to e2e:

Will need wait for the VPC resource controller release

Will this PR introduce any new dependencies?:

Yes, depending on the new CRD

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
It will not.

Does this change require updates to the CNI daemonset config files to work?:

no
Does this PR introduce any user-facing change?:

The node's label for trunk interface will no longer be added.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@haouc haouc requested a review from a team as a code owner June 26, 2023 19:33
@haouc
Copy link
Contributor Author

haouc commented Jun 26, 2023

Note: Since this is waiting for a new release tag of VPC resource controller for an update on the CNINode CRD, it is referring to my fork repo tag by replacing at this moment. We need remove the replace before merging this PR.

Copy link
Contributor

@jdn5126 jdn5126 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change overall LGTM, no major concerns. I added a few questions and comments. For the manifest changes, can we make the aws-vpc-cni chart changes as well? Since these are only additions, they are fine to add in this PR

pkg/ipamd/ipamd.go Outdated Show resolved Hide resolved
pkg/ipamd/ipamd.go Outdated Show resolved Hide resolved
pkg/ipamd/ipamd.go Outdated Show resolved Hide resolved
pkg/ipamd/ipamd.go Outdated Show resolved Hide resolved
pkg/ipamd/ipamd.go Outdated Show resolved Hide resolved
pkg/ipamd/ipamd.go Outdated Show resolved Hide resolved
pkg/k8sapi/k8sutils.go Outdated Show resolved Hide resolved
pkg/sgpp/utils.go Outdated Show resolved Hide resolved
pkg/sgpp/utils.go Outdated Show resolved Hide resolved
@@ -30,3 +38,68 @@ func LoadEnforcingModeFromEnv() EnforcingMode {
return DefaultEnforcingMode
}
}

func AddFeatureToCNINode(ctx context.Context, k8sClient client.Client, nodeName string, featureName rc.FeatureName, featureValue string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not attached to any struct and is "dangling". Is it possible to attach this to a greater context, so we can avoid the complexity of passing dependencies around through method arguments (e.g. k8sClient).

Another thought -- does it make sense to decouple these two concepts:

  1. Constructs of the CNINode
  2. Application of the CNINode

Something like:

node := v1.Node{}
_, err := client.Get(ctx, &node) { ... }
cniNode := CNINodeFor(node)
client.Patch(ctx, &cniNode) { ... }

Copy link
Contributor Author

@haouc haouc Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be added to ipamd.go instead of from another package. Then it can be attached to the ipamd struct.

I prefer to create the CNINode objects from VPC resource controller side since in general VPC resource controller is persisting and tracking/managing the life time of the node object from its point of view. VPC CNI will restart for many reasons and seems no need to worry about creation of this CRD object, like it doesn't care the creation of node object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to create the CNINode objects from VPC resource controller side since in general VPC resource controller is persisting and tracking/managing the life time of the node object from its point of view.

Huge +1. I could see taking this a step further, and taking an assumption that the CRD exists for the given node before taking any actions as a CNI. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nice. I am trying to find out if there is a k8s mechanism which is like a "generator" (the opposite side of finalizer) with Ownership. I think this is no such native controller to do this? In this case, we are relying on the VPC resource controller to generate the CRD object upon its owner node object's creation. Not sure if this is what you were referring to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there's no "generator" concept. I think of the CNINode is being owned by the CNI-agent, as it is the source of truth for the CNINode.spec. I'm not sure I see the benefit of creating the object from VPCRC, regardless of the restart strategy of VPCRC or VPCCNI.

What do you think about the following:

  • VPCCNI wakes up, and applies (create/update) the CNINode to be what it expects it to be.
  • VPCRC watches for the CNINode objects and takes action accordingly (i.e., attaching trunk eni)

Copy link
Contributor Author

@haouc haouc Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the CNINode is being owned by the CNI-agent

I am not sure if this is correct since the agent will change for any reason but the CNINode should be living with the node. My preference is

  • VPC RC (sort of) manage "nodes" and their owning CNINode objects
  • VPC CNI doesn't have to worry about the CNINode creation since it should be taken care by VPC RC
  • VPC CNI shouldn't have create permission on this resource
  • VPC RC managing these objects will benefit future improvements (we can sync offline)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a discussion, I think.

pkg/k8sapi/k8sutils.go Outdated Show resolved Hide resolved
pkg/ipamd/ipamd.go Outdated Show resolved Hide resolved
pkg/ipamd/ipamd.go Outdated Show resolved Hide resolved
pkg/utils/algo/algos.go Outdated Show resolved Hide resolved
go.mod Outdated
@@ -171,3 +173,5 @@ replace golang.org/x/crypto => golang.org/x/crypto v0.1.0
replace github.com/containernetworking/cni => github.com/containernetworking/cni v0.8.1

replace github.com/containernetworking/plugins => github.com/containernetworking/plugins v0.9.1

replace github.com/aws/amazon-vpc-resource-controller-k8s => github.com/haouc/amazon-vpc-resource-controller-k8s v1.2.3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out go.work files, which you can add to .gitignore and not need to edit the module: https://go.dev/doc/tutorial/workspaces

Copy link
Contributor Author

@haouc haouc Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check the doc. My understanding is the multi modules workspace in this case will keep the change local and this PR will fail in GitHub actions (won't build)? Which is fine since this replace change will be removed after the VPC RC has a new tag for CRD update and before this PR is merged. But I think I get your point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haouc lets please make sure PR is merged after tag is released, or other option is cut a RC release on VPC RC and use that tag here. Since it will break master builds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, @jayanthvn. We can discuss offline. This PR has been changed to target release branch as a possible RC release.

@haouc haouc force-pushed the vpcrc-crd branch 2 times, most recently from be25a05 to 5175e74 Compare June 29, 2023 01:21
Copy link
Contributor

@ellistarn ellistarn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

version: 1.13.2
appVersion: "v1.13.2"
version: 1.13.3
appVersion: "v1.13.3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the appVersion updated? We generally update this during release..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

- vpcresources.k8s.aws
resources:
- cninodes
verbs: ["get", "list", "patch"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss on adding "patch" permission offline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we need to remove the update permission from node resource right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need remove node update permission. The problem is we can merge the deletion to master. I have synced with Jeff offline. We may use a feature branch instead.

@@ -138,9 +140,6 @@ const (
// vpcENIConfigLabel is used by the VPC resource controller to pick the right ENI config.
vpcENIConfigLabel = "vpc.amazonaws.com/eniConfig"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even this can be cleaned up right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaning work has been done in the latest update.

// Signal to VPC Resource Controller that the node is using custom networking
err := c.SetNodeLabel(ctx, vpcENIConfigLabel, eniConfigName)
// Add the feature name to CNINode of this node
err := c.AddFeatureToCNINode(ctx, rcv1alpha1.CustomNetworking, eniConfigName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit -> something like AddFeatureToCNINodeResource or anything else maybe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we verified delete flow behavior (CNINode feature set/unset)? i.e, unset env variable -> terminate nodes -> new nodes launch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't unset the features in CNINode. CNiNode itself has owner reference to the node object. When node is deleted, the CNINode will be deleted.

}
// Check that there is room for a trunk ENI to be attached:
if c.dataStore.GetENIs() >= (c.maxENI - c.unmanagedENI) {
log.Error("No slot available for a trunk ENI to be attached. Not labeling the node")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we add these 2 lines here -

podENIErrInc("askForTrunkENIIfNeeded")
log.Errorf("Failed to add SGP feature into the CNINode", err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message has been updated in the updates.

cniNode := &rcv1alpha1.CNINode{}

if err := c.cachedK8SClient.Get(ctx, types.NamespacedName{Name: c.myNodeName}, cniNode); err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel, we need to add some new metrics to track CNINode updates/deletes..you can refer here - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L187

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client go should give us these for free

@@ -47,6 +49,7 @@ func CreateKubeClient(mapper meta.RESTMapper) (client.Client, error) {
vpcCniScheme := runtime.NewScheme()
clientgoscheme.AddToScheme(vpcCniScheme)
eniconfigscheme.AddToScheme(vpcCniScheme)
rcscheme.AddToScheme(vpcCniScheme)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we measured the number of API server calls increase with this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are changing to use rawclient.

@haouc haouc force-pushed the vpcrc-crd branch 4 times, most recently from bacfd13 to dc0e478 Compare July 7, 2023 19:25
@jdn5126 jdn5126 changed the base branch from master to release-1.14 July 7, 2023 22:02
@haouc haouc merged commit f321fa8 into aws:release-1.14 Jul 9, 2023
4 checks passed
@haouc haouc deleted the vpcrc-crd branch July 9, 2023 17:58
jdn5126 pushed a commit to jdn5126/amazon-vpc-cni-k8s that referenced this pull request Jul 18, 2023
jdn5126 pushed a commit to jdn5126/amazon-vpc-cni-k8s that referenced this pull request Jul 24, 2023
jdn5126 added a commit that referenced this pull request Aug 22, 2023
…#2503)

* add CNINode for v1.15.0

* update ipamd event test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants