Deployment e2e tests for MultiKueue #4103

Bobbins228 · 2025-01-30T10:28:45Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR is the second part in enabling deployments for MultiKueue with #4034 being part 1

Which issue(s) this PR fixes:

Part of #3802

Special notes for your reviewer:

@mimowo Seeing as Deployments do not have workload support we are unable to create an adapter and utilize Sync Job for creating a remote deployment and syncing status that way.

On the other hand as the Pods progress on the Manager Cluster the deployment's status is updated as if they were running on single cluster.
Another note is a concern in relation to which cluster the Pods are created in. In this implementation the pods can run on any cluster and are not run as a group on a single cluster. Is this something we should be wary of?

If we are happy with this implementation I can remove the WIP.

Does this PR introduce a user-facing change?

NONE

k8s-ci-robot · 2025-01-30T10:28:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Bobbins228
Once this PR has been reviewed and has the lgtm label, please assign mimowo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-01-30T10:28:56Z

Hi @Bobbins228. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2025-01-30T10:29:05Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`00a6bc6`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/679cac69bfe3a10009ec2b16

mimowo · 2025-01-30T10:46:14Z

@Bobbins228 being able to execute the whole Deployment on a dedicated worker cluster would be great. One con I see is that scaling will probably not work.

One question that comes to my mind first - how do we discriminate which mode of operation a user want to use(via pods or via full workload)? should we discriminate based on a deployment annotation, or you have some alternative ideas?

Bobbins228 · 2025-01-30T10:57:36Z

@mimowo

One question that comes to my mind first - how do we discriminate which mode of operation a user want to use(via pods or via full workload)? should we discriminate based on a deployment annotation, or you have some alternative ideas?

Can you elaborate a bit more on this point? Thanks

mimowo · 2025-01-30T11:24:36Z

Let me try - I might also be missing something here :) .

So, with #4034 I believe Deployments are sort of supported with workloads at the level of Pods. On worker clusters we only have the Pods created. This is the "first" mode of operation. With this mode of operation I assume scaling the Deployment works OOTB.

The "second" mode of operation is to create a workload at the Deployment level on the management cluster. Then, the entire Deployment could be copied onto the worker cluster and manage the pods there. IIUC in this mode of operation scaling would not work OOTB (but we may check, and add e2e test). I suppose we would need extra code to update the Deployment on a worker.

Now, IIUC with this PR you enable the "second" mode of operation. However, it may have a downside compared to (1.) . So, I'm wondering if a user could still have an option to choose (1.) if they need scaling? However, propably we could make scaling work with (2.) as well, so maybe we don't need to modes...

Bobbins228 · 2025-01-30T12:33:23Z

@mimowo
For now we are looking into mode 1, we can create a follow on PR for mode 2 if that is acceptable?

Just off the top of my head for mode 2 PR we could check owner references of Pods and ensure all Pods created through deployments are scheduled on a specific Cluster. We could potentially allow a user the option to change which mode they want to use through a "MultiKueue" configuration parameter in the Kueue manager config. WDYT?

mimowo · 2025-01-30T12:43:11Z

For now we are looking into mode 1,

I see, I was confused by this line https://github.com/kubernetes-sigs/kueue/pull/4103/files#diff-5c2c085d3754c50f6e899f937ef8113a2fc81685d7130b5bf88d1a37d2f26d67R58. Can you elaborate why we need support for prebuilt workloads for Deployments to make it work in "mode1

we can create a follow on PR for mode 2 if that is acceptable?

I think mode 2 would make sense too, but we may get a bit more details, so I would suggest a KEP to discuss the details, such as configuration for mode1 vs mode2. And ideal if you could support it with some use-cases you have. One thing is making sure that all Pods land on the same worker I suppose. Also, scalability could be better as we don't need to create the dummy pods in the management cluster.

We could potentially allow a user the option to change which mode they want to use through a "MultiKueue" configuration parameter in the Kueue manager config. WDYT?

Yeah, I think we could start with the global config option. Making it granular per workload seems possible, but could get messy.

Bobbins228 · 2025-01-30T13:41:26Z

@mimowo

I see, I was confused by this line https://github.com/kubernetes-sigs/kueue/pull/4103/files#diff-5c2c085d3754c50f6e899f937ef8113a2fc81685d7130b5bf88d1a37d2f26d67R58. Can you elaborate why we need support for prebuilt workloads for Deployments to make it work in "mode1

My bad, I have removed the added validation.

I think mode 2 would make sense too, but we may get a bit more details, so I would suggest a KEP to discuss the details, such as configuration for mode1 vs mode2. And ideal if you could support it with some use-cases you have. One thing is making sure that all Pods land on the same worker I suppose. Also, scalability could be better as we don't need to create the dummy pods in the management cluster.

A KEP would be great I can make an issue for it.

Yeah, I think we could start with the global config option. Making it granular per workload seems possible, but could get messy.

+1, we would also need a better naming convention than "mode 1", "mode 2" :)

Updating this issue to cover just e2e tests for mode 1. I am going to make it more robust by also including a test for scaling up the deployment and ensuring more Workloads are created.

mimowo · 2025-01-30T15:41:41Z

cc @mszadkow Could you make the first pass?

mszadkow · 2025-01-30T16:07:30Z

/ok-to-test

mszadkow · 2025-01-30T16:14:09Z

test/e2e/multikueue/e2e_test.go

+
+				for _, pod := range pods.Items {
+					// Ensure all 4 local pods should be in "Running" phase
+					gomega.Eventually(func(g gomega.Gomega) {


I think pods shouldn't be scheduled on the management cluster
Thus you need to observe their status in worker

In order to know on which worker they will be it's good to request resources in certain type and amount that is available to only one of the workers

Unfortunately this does not work with pods created through deployments.
I tested by setting the deployment resources to 1.5 CPUs before expecting that all pods would be admitted to worker1 but this was not the case.

Kueue will schedule the pods on any cluster with the necessary quota available for mode 1 discussed above.
Luckily though there is another way to know what pods are scheduled where as the pods would be assigned a nodeName of either kind-worker1-control-plane or kind-worker2-control-plane on the remote clusters. This method would be a bit convoluted so I am looking for a better option.

mszadkow · 2025-01-30T16:15:23Z

test/e2e/multikueue/e2e_test.go

+				})
+
+				pods := &corev1.PodList{}
+				gomega.Expect(k8sManagerClient.List(ctx, pods, client.InNamespace(managerNs.Namespace),


this should happen on the worker cluster

mszadkow · 2025-01-30T16:15:54Z

test/e2e/multikueue/e2e_test.go

+					// Local pods should be in "Running" phase
+					gomega.Eventually(func(g gomega.Gomega) {
+						createdPod := &corev1.Pod{}
+						g.Expect(k8sManagerClient.Get(ctx, client.ObjectKey{Namespace: pod.Namespace, Name: pod.Name}, createdPod)).To(gomega.Succeed())


this should happen on the worker cluster

mszadkow · 2025-01-30T16:16:35Z

test/e2e/multikueue/e2e_test.go

+			ginkgo.By("Waiting for the deployment to get status updates", func() {
+				gomega.Eventually(func(g gomega.Gomega) {
+					createdDeployment := appsv1.Deployment{}
+					g.Expect(k8sManagerClient.Get(ctx, client.ObjectKeyFromObject(deployment), &createdDeployment)).To(gomega.Succeed())


that's correct, we should see status update on the deployment in manager cluster

Bobbins228 · 2025-01-30T16:23:41Z

@mszadkow Thanks for the review, I'll get right on it

mszadkow · 2025-01-31T15:05:55Z

I understand that there is no "finish line" for Deployment, Pods just running.
That's a new type of testing in e2e, I guess it's fine.
/lgtm

k8s-ci-robot · 2025-01-31T15:06:06Z

LGTM label has been added.

Git tree hash: a4fd4cdbbdea1f8ffe5c30a480274251ebbc63eb

mimowo · 2025-01-31T15:22:33Z

test/e2e/multikueue/e2e_test.go

+			pods = &corev1.PodList{}
+			gomega.Expect(k8sManagerClient.List(ctx, pods, client.InNamespace(managerNs.Namespace),
+				client.MatchingLabels(deployment.Spec.Selector.MatchLabels))).To(gomega.Succeed())
+
+			// Check all worker pods are in a running phase.
+			ensurePodWorkloadsRunning(pods, *managerNs, multiKueueAc)


nit, actually instead of the comment could you wrap this in a ginkgo.By("Check all worker pods are in a running phase")

Within the function is a ginkgo.By("Ensure Pod Workloads are created and Pods are Running on the worker cluster")
The comment is just for dev readability.

Ah, I see, but still something feels off with the code being split outside and inside the helper function.

I would like to consider the following, move this code inside the function:

pods = &corev1.PodList{} gomega.Expect(k8sManagerClient.List(ctx, pods, client.InNamespace(managerNs.Namespace), client.MatchingLabels(deployment.Spec.Selector.MatchLabels))).To(gomega.Succeed())

but keep the gingo.By outside. WDYT?

(or make the ginkgo.By as the first line of the helper function)

I like the first option better will make things more uniform, thanks mimowo

mimowo · 2025-01-31T15:23:17Z

test/e2e/multikueue/e2e_test.go

+				client.MatchingLabels(deployment.Spec.Selector.MatchLabels))).To(gomega.Succeed())
+
+			// Check all worker pods are in a running phase.
+			ensurePodWorkloadsRunning(pods, *managerNs, multiKueueAc)


Maybe you could also add one more ginkgo.By - delete the Deployment and verify the Pods are gone?

mimowo · 2025-01-31T15:28:12Z

test/e2e/multikueue/e2e_test.go

+			// By checking the assigned cluster we can discern which client to use
+			admissionCheckMessage := workload.FindAdmissionCheck(createdLeaderWorkload.Status.AdmissionChecks, multiKueueAc.Name).Message
+			workerCluster := k8sWorker1Client
+			if strings.Contains(admissionCheckMessage, "worker2") {
+				workerCluster = k8sWorker2Client
+			}


nit: Right, this will work but relies on the assumption we have only 2 workers. I would suggest to introduce a helper function which gets the worker name in the utils, for re-usability.

EDIT: for now maybe we can use a regex? For later I would suggest we have a dedicated API field, but this would be a separate enhancement.

Regex sounds good but in terms of pairing up the retrieved worker name with the appropriate worker client I am at a bit of a loss.

Unless we have a map of cluster names and clients that we can compare to the given cluster name?

Ah, I see, yeah we can add the map which we initialize along with the clients.

mimowo · 2025-01-31T15:28:39Z

test/e2e/multikueue/e2e_test.go

+			pods := &corev1.PodList{}
+			gomega.Expect(k8sManagerClient.List(ctx, pods, client.InNamespace(managerNs.Namespace),
+				client.MatchingLabels(deployment.Spec.Selector.MatchLabels))).To(gomega.Succeed())
+
+			// Check all worker pods are in a running phase.
+			ensurePodWorkloadsRunning(pods, *managerNs, multiKueueAc)


Also wrap this in a ginkgo.By step.

mimowo · 2025-01-31T15:29:06Z

test/e2e/multikueue/e2e_test.go

+		for _, pod := range pods.Items { // We want to test that all deployment pods have workloads.
+			createdLeaderWorkload := &kueue.Workload{}
+			wlLookupKey := types.NamespacedName{Name: workloadpod.GetWorkloadNameForPod(pod.Name, pod.UID), Namespace: managerNs.Name}
+			gomega.Eventually(func(g gomega.Gomega) {


This Eventually does not seem needed, right?

mimowo

LGTM, just nits

k8s-ci-robot requested review from gabesaba and mbobrovskyi January 30, 2025 10:28

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 30, 2025

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 30, 2025

Bobbins228 force-pushed the deployment-mk branch from b74a320 to a5630a2 Compare January 30, 2025 13:41

Bobbins228 changed the title ~~[WIP] Deployment support for MultiKueue~~ [WIP] Deployment e2e tests for MultiKueue Jan 30, 2025

Bobbins228 force-pushed the deployment-mk branch from a5630a2 to 62a38d5 Compare January 30, 2025 15:34

Bobbins228 changed the title ~~[WIP] Deployment e2e tests for MultiKueue~~ Deployment e2e tests for MultiKueue Jan 30, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 30, 2025

mszadkow reviewed Jan 30, 2025

View reviewed changes

test: add deployment multikueue e2e tests

00a6bc6

Bobbins228 force-pushed the deployment-mk branch from 62a38d5 to 00a6bc6 Compare January 31, 2025 10:56

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 31, 2025

k8s-ci-robot assigned mszadkow Jan 31, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 31, 2025

mimowo reviewed Jan 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment e2e tests for MultiKueue #4103

Deployment e2e tests for MultiKueue #4103

Bobbins228 commented Jan 30, 2025 •

edited

Loading

k8s-ci-robot commented Jan 30, 2025

k8s-ci-robot commented Jan 30, 2025

netlify bot commented Jan 30, 2025 •

edited

Loading

mimowo commented Jan 30, 2025

Bobbins228 commented Jan 30, 2025

mimowo commented Jan 30, 2025

Bobbins228 commented Jan 30, 2025

mimowo commented Jan 30, 2025

Bobbins228 commented Jan 30, 2025

mimowo commented Jan 30, 2025

mszadkow commented Jan 30, 2025

mszadkow Jan 30, 2025

mszadkow Jan 30, 2025

Bobbins228 Jan 30, 2025

mszadkow Jan 30, 2025

mszadkow Jan 30, 2025

mszadkow Jan 30, 2025

Bobbins228 commented Jan 30, 2025

mszadkow commented Jan 31, 2025

k8s-ci-robot commented Jan 31, 2025

mimowo Jan 31, 2025

Bobbins228 Jan 31, 2025

mimowo Jan 31, 2025

mimowo Jan 31, 2025

Bobbins228 Jan 31, 2025

mimowo Jan 31, 2025

mimowo Jan 31, 2025 •

edited

Loading

Bobbins228 Jan 31, 2025

mimowo Jan 31, 2025

mimowo Jan 31, 2025

mimowo Jan 31, 2025

mimowo left a comment

Deployment e2e tests for MultiKueue #4103

Are you sure you want to change the base?

Deployment e2e tests for MultiKueue #4103

Conversation

Bobbins228 commented Jan 30, 2025 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Jan 30, 2025

k8s-ci-robot commented Jan 30, 2025

netlify bot commented Jan 30, 2025 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

mimowo commented Jan 30, 2025

Bobbins228 commented Jan 30, 2025

mimowo commented Jan 30, 2025

Bobbins228 commented Jan 30, 2025

mimowo commented Jan 30, 2025

Bobbins228 commented Jan 30, 2025

mimowo commented Jan 30, 2025

mszadkow commented Jan 30, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bobbins228 commented Jan 30, 2025

mszadkow commented Jan 31, 2025

k8s-ci-robot commented Jan 31, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo left a comment

Choose a reason for hiding this comment

Bobbins228 commented Jan 30, 2025 •

edited

Loading

netlify bot commented Jan 30, 2025 •

edited

Loading

mimowo Jan 31, 2025 •

edited

Loading