Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add an e2e test for K8s Job submitter failures #2688

Conversation

simotw
Copy link
Contributor

@simotw simotw commented Dec 26, 2024

Why are these changes needed?

Add an test for #2579.

  1. Create a RayJob
  2. Wait for the Pod created by the K8s Job submitter to be ready
  3. Let the Pod to run for a while (e.g. 15s)
  4. Delete the Pod
  5. The K8s Job should create another Pod
  6. The RayJob should eventually succeed.
  7. The RayJobDeployment eventually complete.
    (Add this according to the issue [Bug] RayJob falsely marked as "Running" when driver fails #2154)

CleanShot 2024-12-26 at 15 15 00@2x

Related issue number

Close #2655

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@simotw simotw force-pushed the feat/add-e2e-test-for-k8s-job-submitter-failures branch from cabcdc0 to 9b85e43 Compare December 26, 2024 06:50
@simotw simotw force-pushed the feat/add-e2e-test-for-k8s-job-submitter-failures branch from a88d325 to 0ddad63 Compare December 26, 2024 07:18
@simotw simotw marked this pull request as ready for review December 26, 2024 07:27
@kevin85421
Copy link
Member

cc @rueian would you mind reviewing this PR? Thanks!

@simotw simotw requested a review from rueian December 28, 2024 03:25
@rueian
Copy link
Contributor

rueian commented Dec 28, 2024

LGTM.

I just also come up with a new idea of not relying on the sleep operation but using the named actor for this e2e test: the job script can exit with an manual error after registering a named actor to Ray and later being restart by k8s it can exit normally as the named actor is already registered.

This way we don’t need to wait for 15 seconds to delete the pod and 1 more minute for job completion.

What do you think @simotw, @kevin85421? Maybe this can be another PR.

@rueian
Copy link
Contributor

rueian commented Dec 28, 2024

Sorry, please ignore the above idea. It doesn’t make submitter fail.

@kevin85421 kevin85421 merged commit 9d25660 into ray-project:master Dec 28, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Add an e2e test for K8s Job submitter failures
3 participants