Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add e2e tests for manageJobsWithoutQueueName #4112

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kaisoz
Copy link
Contributor

@kaisoz kaisoz commented Jan 30, 2025

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This PR adds e2e tests for the manageJobsWithoutQueueName: true configuration. This is one of the tasks required to close #3767

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

The test suite sets manageJobsWithoutQueueName: true during the setup phase of the description block (just before running the test specs). It then initiates a rollout of the controller deployment to ensure that the new configuration is picked up by the pods. To confirm that the deployment rollout has started, the setup function waits for the Deployment to have the condition DeploymentCondition=false.

To ensure that this condition is met, the rollout strategy is set to Recreate, which guarantees that the pods are terminated before being recreated, resulting in the condition not being satisfied (see DeploymentConditionTypes docs).

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Jan 30, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 30, 2025
Copy link

netlify bot commented Jan 30, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 7b42d2f
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/679c6e6a62b88c0008ae4d74

@kaisoz
Copy link
Contributor Author

kaisoz commented Jan 30, 2025

/assign @mimowo

PTAL! Thanks! 😊

@kaisoz kaisoz force-pushed the managejobswithoutqueuename-e2e branch from 8cbae20 to 49bd6c4 Compare January 30, 2025 23:39
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kaisoz
Once this PR has been reviewed and has the lgtm label, please ask for approval from mimowo. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kaisoz kaisoz force-pushed the managejobswithoutqueuename-e2e branch from 49bd6c4 to 7b42d2f Compare January 31, 2025 06:32
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR, the new testing suite will be great to make sure the configuration works, which is really tricky and took lots of manual testing for now.

My main hesitation is about the way we run the new suite, and as pointed out in the comment I would be leaning towards a dedicated CI job. However, let me collect more feedback. I'm ok with the (2.) or (3.) approaches (from the comment) as interim.

@mbobrovskyi @mszadkow @dgrove-oss PTAL

func ApplyKueueConfiguration(ctx context.Context, k8sClient client.Client, kueueCfg *configapi.Configuration) {
configMap := &corev1.ConfigMap{}
kcmKey := types.NamespacedName{Namespace: "kueue-system", Name: "kueue-manager-config"}
config, _ := yaml.Marshal(kueueCfg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't silence errors

Comment on lines +125 to +129
// Ensure that the rollout has started by waiting for the deployment to be unavailable
g.Expect(deployment.Status.Conditions).To(gomega.ContainElement(gomega.BeComparableTo(
appsv1.DeploymentCondition{Type: appsv1.DeploymentAvailable, Status: corev1.ConditionFalse},
cmpopts.IgnoreFields(appsv1.DeploymentCondition{}, "Reason", "Message", "LastUpdateTime", "LastTransitionTime")),
))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest not to check Availbale=False, because it could get racy if the rastart takes little time.

"sigs.k8s.io/kueue/test/util"
)

var _ = ginkgo.Describe("ManageJobsWithoutQueueName", ginkgo.Ordered, func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make sure this is part of a suite.

config := util.GetKueueConfiguration(ctx, k8sClient)
config.ManageJobsWithoutQueueName = true
util.ApplyKueueConfiguration(ctx, k8sClient, config)
util.RestartKueueController(ctx, k8sClient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm not sure what is the best approach here, the restarting of Kueue adds overhead, and the tests need to be sequential.

Generally, I see three approaches to test different configurations:

  1. Have a dedicated CI job (as we currently have for singlecluster and multicluster)
  2. Have two e2e targets which create independent kind clusters, also sequentially. Here is a good example how the first setup for queue and multikueue was done: https://github.com/kubernetes-sigs/kueue/pull/1360/files#diff-76ed074a9305c04054cdebb9e9aad2d818052b07091de1f20cad0bbac34ffb52
  3. restart Kueue (as done here)

Also, (3.) is not great when we want to run the tests in isolation, because it would always restart Kueue on the cluster, which adds extra overhead.

So, my preference is ultimately (1.) I think, with (2.) and potentially (3.) as interim steps. Let me check if others have an opinion @mbobrovskyi @mszadkow @dgrove-oss @PBundyra ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (1.) + (2.) is better solution. We just need to create separate kustomize file for this case with patch for changing managedByWithoutQueueName field.

@mimowo
Copy link
Contributor

mimowo commented Jan 31, 2025

otoh we may want to test more e2e configurations in the future.

So I have an idea to mix the strategies and have a new suite customconfigs, where we could test custom configurations and restart kueue between them. i believe in practice for which configuration we will have only a handful of tests which are specific to them, and the main body of tests will be still in singlecluster suite. So this way we would avoid creating a ci job per config, while also offload the main run. wdyt?

For now I would suggest in this PR to put the e2e tests under e2e/customconfig/managedwithoutqueuebame. wdyt?

@@ -151,6 +154,15 @@ run-test-tas-e2e-%: FORCE
./hack/e2e-test.sh
$(PROJECT_DIR)/bin/ginkgo-top -i $(ARTIFACTS)/$@/e2e.json > $(ARTIFACTS)/$@/e2e-top.yaml

run-test-queuename-e2e-%: K8S_VERSION = $(@:run-test-queuename-e2e-%=%)
run-test-queuename-e2e-%: FORCE
@echo Running e2e for k8s ${K8S_VERSION}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@echo Running e2e for k8s ${K8S_VERSION}
@echo Running without queue name e2e for k8s ${K8S_VERSION}

@@ -151,6 +154,15 @@ run-test-tas-e2e-%: FORCE
./hack/e2e-test.sh
$(PROJECT_DIR)/bin/ginkgo-top -i $(ARTIFACTS)/$@/e2e.json > $(ARTIFACTS)/$@/e2e-top.yaml

run-test-queuename-e2e-%: K8S_VERSION = $(@:run-test-queuename-e2e-%=%)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
run-test-queuename-e2e-%: K8S_VERSION = $(@:run-test-queuename-e2e-%=%)
run-test-withoutqueuename-e2e-%: K8S_VERSION = $(@:run-test-queuename-e2e-%=%)

Maybe like this?

@echo Running e2e for k8s ${K8S_VERSION}
E2E_KIND_VERSION="kindest/node:v$(K8S_VERSION)" KIND_CLUSTER_NAME=$(KIND_CLUSTER_NAME) CREATE_KIND_CLUSTER=$(CREATE_KIND_CLUSTER) \
ARTIFACTS="$(ARTIFACTS)/$@" IMAGE_TAG=$(IMAGE_TAG) GINKGO_ARGS="$(GINKGO_ARGS)" \
JOBSET_VERSION=$(JOBSET_VERSION) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to prepare jobset?

Comment on lines +38 to +39
visibilityClient visibilityv1beta1.VisibilityV1beta1Interface
impersonatedVisibilityClient visibilityv1beta1.VisibilityV1beta1Interface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
visibilityClient visibilityv1beta1.VisibilityV1beta1Interface
impersonatedVisibilityClient visibilityv1beta1.VisibilityV1beta1Interface

Comment on lines +57 to +58
visibilityClient = util.CreateVisibilityClient("")
impersonatedVisibilityClient = util.CreateVisibilityClient("system:serviceaccount:kueue-system:default")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
visibilityClient = util.CreateVisibilityClient("")
impersonatedVisibilityClient = util.CreateVisibilityClient("system:serviceaccount:kueue-system:default")

util.ExpectObjectToBeDeleted(ctx, k8sClient, defaultRf, true)
})

ginkgo.It("should suspend it", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we join it with "should unsuspend it"? Looks like logic the same.


ginkgo.It("should suspend it", func() {
var testJob *batchv1.Job
ginkgo.By("creating a job without queue name", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please emphasize that the job in unsuspended?

}
createdJob.Labels["kueue.x-k8s.io/queue-name"] = "main"
return k8sClient.Update(ctx, createdJob)
}, util.LongTimeout, util.Interval).Should(gomega.Succeed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need long timeouts here and below?

gomega.Eventually(func(g gomega.Gomega) {
g.Expect(k8sClient.Get(ctx, jobLookupKey, createdJob)).Should(gomega.Succeed())
g.Expect(ptr.Deref(createdJob.Spec.Suspend, false)).To(gomega.BeFalse())
}, util.LongTimeout, util.Interval).Should(gomega.Succeed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

gomega.Eventually(func(g gomega.Gomega) {
g.Expect(k8sClient.Get(ctx, wlLookupKey, createdWorkload)).Should(gomega.Succeed())
g.Expect(createdWorkload.Status.Admission).ShouldNot(gomega.BeNil())
}, util.LongTimeout, util.Interval).Should(gomega.Succeed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


waitForAvailableStart := time.Now()
util.WaitForKueueAvailability(ctx, k8sClient)
util.WaitForJobSetAvailability(ctx, k8sClient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need JobSet in this suite?

@mbobrovskyi
Copy link
Contributor

mbobrovskyi commented Jan 31, 2025

otoh we may want to test more e2e configurations in the future.

So I have an idea to mix the strategies and have a new suite customconfigs, where we could test custom configurations and restart kueue between them. i believe in practice for which configuration we will have only a handful of tests which are specific to them, and the main body of tests will be still in singlecluster suite. So this way we would avoid creating a ci job per config, while also offload the main run. wdyt?

For now I would suggest in this PR to put the e2e tests under e2e/customconfig/managedwithoutqueuebame. wdyt?

If we need to create custom configurations with the ability to run different setups, I suggest moving the creation and deletion logic of Kueue to the Go part instead of running it on the bash script. We can start Kueue in BeforeSuite and delete it in AfterSuite, allowing each test suite to run with the required configuration.

@mimowo
Copy link
Contributor

mimowo commented Jan 31, 2025

can you elaborate a bit why? it might be a good idea but Im not sure how this relates to the PR. Also would it affect the singlecluster and multicluster suites?

Also is this an alternative to the 1- 3 ideas or within one of them?

EDIT: iiuc moving the kubectl apply before suite introduces overhead, so i would prefer to avoid it for the main singlecluster suite which is often used againt an already running cluster with kueue.

@PBundyra
Copy link
Contributor

otoh we may want to test more e2e configurations in the future.

So I have an idea to mix the strategies and have a new suite customconfigs, where we could test custom configurations and restart kueue between them. i believe in practice for which configuration we will have only a handful of tests which are specific to them, and the main body of tests will be still in singlecluster suite. So this way we would avoid creating a ci job per config, while also offload the main run. wdyt?

For now I would suggest in this PR to put the e2e tests under e2e/customconfig/managedwithoutqueuebame. wdyt?

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add higher-level of testing for "queue-name" handling in pod-based workloads
5 participants