Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update AppWrappers to v0.21.1 #588

Merged
merged 1 commit into from
Jul 15, 2024

Conversation

dgrove-oss
Copy link
Collaborator

Highlights:

  • Update to Kueue 0.7.1
  • Remove ChildAdmissionController only needed by Kueue 0.6
  • Flexible configuration of Kueue's GenericJob Reconciller

Full Changelog:
project-codeflare/appwrapper@v0.20.2...v0.21.0

Copy link
Collaborator

@KPostOffice KPostOffice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all look sane to me. Do you have any clue about the failing e2e test. Could it be due to some change in logic in the AW controller?

@dgrove-oss
Copy link
Collaborator Author

At first glance in the logs, it looks like the AppWrapper+RayCluster workload is pending because a GPU isn't available, but I haven't analyzed fully to figure out if that is just a transitory problem that might resolve once the concurrently running test that is using the GPU first finishes..

10T14:54:39.771940512Z","logger":"scheduler","caller":"scheduler/scheduler.go:263","msg":"Workload requires preemption, but there are no candidate workloads allowed for preemption","workload":{"name":"appwrapper-raycluster-af4b6","namespace":"test-ns-vpd2b"},"clusterQueue":{"name":"e2e-cluster-queue"},"preemption":{"reclaimWithinCohort":"Never","borrowWithinCohort":{"policy":"Never"},"withinClusterQueue":"Never"}}
{"level":"Level(-2)","ts":"2024-07-10T14:54:39.771986932Z","logger":"scheduler","caller":"scheduler/scheduler.go:619","msg":"Workload re-queued","workload":{"name":"appwrapper-raycluster-af4b6","namespace":"test-ns-vpd2b"},"clusterQueue":{"name":"e2e-cluster-queue"},"queue":{"name":"lq-vbsxg","namespace":"test-ns-vpd2b"},"requeueReason":"","added":true}
{"level":"Level(-2)","ts":"2024-07-10T14:54:39.772046413Z","logger":"cluster-queue-reconciler","caller":"core/clusterqueue_controller.go:330","msg":"Got generic event","obj":{"name":"appwrapper-raycluster-af4b6","namespace":"test-ns-vpd2b"},"kind":"/, Kind="}
{"level":"debug","ts":"2024-07-10T14:54:39.780801696Z","logger":"events","caller":"recorder/recorder.go:104","msg":"couldn't assign flavors to pod set raycluster-0-1: insufficient unused quota for nvidia.com/gpu in flavor default-flavor, 1 more needed","type":"Normal","object":{"kind":"Workload","namespace":"test-ns-vpd2b","name":"appwrapper-raycluster-af4b6","uid":"fd964893-f987-47b1-9858-7001fa2c6a89","apiVersion":"kueue.x-k8s.io/v1beta1","resourceVersion":"3691"},"reason":"Pending"}

@sutaakar
Copy link
Contributor

I have already seen this intermittently, seems caused by AppWrapper workload not being admitted:

2024-07-10 14:54:39  | 2024-07-10 14:54:39  | appwrapper-raycluster-af4b6.17e0e1c1b799975a  | -          | Normal  | Pending          | couldn't assign flavors to pod set raycluster-0-1: insufficient unused quota for nvidia.com/gpu in flavor default-flavor, 1 more needed  | 
2024-07-10 14:54:39  | 2024-07-10 14:54:39  | raycluster.17e0e1c1b7086150                   | -          | Normal  | CreatedWorkload  | Created Workload: test-ns-vpd2b/appwrapper-raycluster-af4b6      

Though I wasn't able to reproduce it yet.
Will try to get some more context to that failure (LocalQueue status, Workloads and such).

@dgrove-oss
Copy link
Collaborator Author

Going to hold because back leveling Go to 1.22.2 is more trouble that it is worth.

@sutaakar
Copy link
Contributor

e2e test instability is addressed in #591

Highlights:
  * Update to Kueue 0.7.1
  * Remove ChildAdmissionController only needed by Kueue 0.6
  * Flexible configuration of Kueue's GenericJob Reconciller
  * Correctly handle wrapped resources that use generateName

Full Changelog:
  project-codeflare/appwrapper@v0.20.2...v0.21.1
@dgrove-oss dgrove-oss changed the title Update AppWrappers to v0.21.0 Update AppWrappers to v0.21.1 Jul 12, 2024
replace github.com/project-codeflare/appwrapper v0.21.1 => github.com/project-codeflare/appwrapper v0.21.2-0.20240712173553-5b007c947b37

replace sigs.k8s.io/kueue v0.7.1 => github.com/opendatahub-io/kueue v0.7.0-odh-test

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not thrilled with these replace directives, but we can't have the go.version at 1.22.2 and depend directly on the upstream kueue and appwrapper releases which are at 1.22.4 without breaking all the go tooling.

Copy link

openshift-ci bot commented Jul 15, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sutaakar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 03d5bee into project-codeflare:main Jul 15, 2024
8 checks passed
@dgrove-oss dgrove-oss deleted the bump-aw branch July 15, 2024 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants