Add examples for three existing failure policy actions. #601

jedwins1998 · 2024-06-10T22:23:29Z

Add examples for each of the following failure policy actions:

FailJobSet,
RestartJobSet,
RestartJobSetAndIgnoreMaxRestarts.

Fixes #600.

k8s-ci-robot · 2024-06-10T22:23:38Z

Hi @jedwins1998. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2024-06-10T22:23:45Z

✅ Deploy Preview for kubernetes-sigs-jobset canceled.

Name	Link
🔨 Latest commit	`d097262`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-jobset/deploys/66e1adb57580330007cc62ff

danielvegamyhre · 2024-06-10T22:36:40Z

/ok-to-test

examples/failure-policy/failjobset-action.yaml

examples/failure-policy/restartjobsetandignoremaxrestarts-action.yaml

Add examples for each of the following failure policy actions: 1. FailJobSet, 2. RestartJobSet, 3. RestartJobSetAndIgnoreMaxRestarts.

…ilureReasons present.

…rereasons-present.yaml'.

…PodFailurePolicy.

examples/failure-policy/failjobset-action.yaml

examples/failure-policy/restartjobsetandignoremaxrestarts-action.yaml

ahg-g · 2024-08-31T20:14:26Z

Is this ready to merge?

danielvegamyhre · 2024-08-31T22:08:40Z

Is this ready to merge?

As of last week, Giuseppe (AI Infra team in GKE) was taking over this PR.

jedwins1998 · 2024-09-04T18:35:46Z

Is this ready to merge?

As of last week, Giuseppe (AI Infra team in GKE) was taking over this PR.

I spoke with Giuseppe and I'll finish up this PR since I started it.

…policy-examples

jedwins1998 · 2024-09-04T22:18:44Z

Is this ready to merge?

As of last week, Giuseppe (AI Infra team in GKE) was taking over this PR.

I spoke with Giuseppe and I'll finish up this PR since I started it.

I added an example similar to a host maintenance event. I also added short descriptions of the expected behavior in each example. I consider this PR ready to merge now.

kannon92 · 2024-09-05T00:22:10Z

examples/failure-policy/host-maintenance-event-model.yaml

+      - action: RestartJobSetAndIgnoreMaxRestarts
+        onJobFailureReasons:
+        - PodFailurePolicy
+      # The JobSet is restarted as normal when the leader job fails and the above rule is not matched.


What does it mean for this to restart with maxRestarts 0? It would fail right away right?

Yes, that is correct. It would fail right away.

kannon92 · 2024-09-05T00:22:57Z

examples/failure-policy/onjobfailurereasons-present.yaml

+  failurePolicy:
+    maxRestarts: 3
+    rules:
+      # The JobSet will restart and unlimited number of times when the


Suggested change

# The JobSet will restart and unlimited number of times when the

# The JobSet will restart an unlimited number of times when the

danielvegamyhre · 2024-09-08T16:57:27Z

examples/failure-policy/host-maintenance-event-model.yaml

+                  echo "$i"
+                  sleep 1
+                done
+        podFailurePolicy:


Can you add a comment here explaining this pod failure policy will trigger on host maintenance events when pods are evicted from the affected nodes, thus failing with a condition type of DisruptionTarget?

Pod failure policy is a fairly new, advanced Job API feature that many users won't be familiar with.

examples/failure-policy/restartjobsetandignoremaxrestarts-action.yaml

ahg-g · 2024-09-13T23:12:23Z

/retest

danielvegamyhre · 2024-09-20T01:44:53Z

/lgtm
/approve

k8s-ci-robot · 2024-09-20T01:44:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielvegamyhre, jedwins1998

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [danielvegamyhre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

danielvegamyhre · 2024-09-20T01:45:16Z

Going to make a couple of small changes to this in a follow up

k8s-ci-robot requested review from danielvegamyhre and kannon92 June 10, 2024 22:23

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 10, 2024

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 10, 2024

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 10, 2024

googs1025 reviewed Jun 11, 2024

View reviewed changes

examples/failure-policy/failjobset-action.yaml Show resolved Hide resolved

danielvegamyhre reviewed Jun 11, 2024

View reviewed changes

examples/failure-policy/restartjobsetandignoremaxrestarts-action.yaml Show resolved Hide resolved

jedwins1998 force-pushed the adding-configurable-failure-policy-examples branch from 37f5a92 to c08778c Compare June 13, 2024 22:06

Justin Edwins added 4 commits June 25, 2024 21:29

Add examples for three existing failure policy actions.

fd29fa0

Add examples for each of the following failure policy actions: 1. FailJobSet, 2. RestartJobSet, 3. RestartJobSetAndIgnoreMaxRestarts.

Add example for configurable failure policy using a rule with onJobFa…

f19acf0

…ilureReasons present.

Correct the name of the jobset in 'examples/failure-policy/onjobfailu…

13bc476

…rereasons-present.yaml'.

Add example using onJobFailureReasons with the selected reason being …

dad8964

…PodFailurePolicy.

jedwins1998 force-pushed the adding-configurable-failure-policy-examples branch from 721c42f to dad8964 Compare June 25, 2024 21:30

danielvegamyhre self-assigned this Jul 1, 2024

danielvegamyhre reviewed Jul 2, 2024

View reviewed changes

examples/failure-policy/failjobset-action.yaml Show resolved Hide resolved

examples/failure-policy/restartjobsetandignoremaxrestarts-action.yaml Show resolved Hide resolved

googs1025 mentioned this pull request Jul 27, 2024

chore: use symbolic link instead of directory #630

Merged

danielvegamyhre added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Aug 10, 2024

Justin Edwins added 3 commits September 4, 2024 15:04

Merge branch 'kubernetes-sigs:main' into adding-configurable-failure-…

aa983cc

…policy-examples

Add example similar to a host maintenance event.

c740191

Add short descriptions of expected behavior in examples.

4be32d6

kannon92 reviewed Sep 5, 2024

View reviewed changes

danielvegamyhre reviewed Sep 8, 2024

View reviewed changes

Justin Edwins added 2 commits September 11, 2024 14:40

Fix grammatical error.

1736c95

Add commment describing host maintenance example.

d097262

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 20, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 20, 2024

k8s-ci-robot merged commit 665bc42 into kubernetes-sigs:main Sep 20, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add examples for three existing failure policy actions. #601

Add examples for three existing failure policy actions. #601

jedwins1998 commented Jun 10, 2024

k8s-ci-robot commented Jun 10, 2024

netlify bot commented Jun 10, 2024 •

edited

Loading

danielvegamyhre commented Jun 10, 2024

ahg-g commented Aug 31, 2024

danielvegamyhre commented Aug 31, 2024

jedwins1998 commented Sep 4, 2024

jedwins1998 commented Sep 4, 2024

kannon92 Sep 5, 2024

jedwins1998 Sep 11, 2024

kannon92 Sep 5, 2024

jedwins1998 Sep 11, 2024

danielvegamyhre Sep 8, 2024

jedwins1998 Sep 11, 2024

ahg-g commented Sep 13, 2024

danielvegamyhre commented Sep 20, 2024

k8s-ci-robot commented Sep 20, 2024

danielvegamyhre commented Sep 20, 2024

	# The JobSet will restart and unlimited number of times when the
	# The JobSet will restart an unlimited number of times when the

Add examples for three existing failure policy actions. #601

Add examples for three existing failure policy actions. #601

Conversation

jedwins1998 commented Jun 10, 2024

k8s-ci-robot commented Jun 10, 2024

netlify bot commented Jun 10, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-jobset canceled.

danielvegamyhre commented Jun 10, 2024

ahg-g commented Aug 31, 2024

danielvegamyhre commented Aug 31, 2024

jedwins1998 commented Sep 4, 2024

jedwins1998 commented Sep 4, 2024

kannon92 Sep 5, 2024

Choose a reason for hiding this comment

jedwins1998 Sep 11, 2024

Choose a reason for hiding this comment

kannon92 Sep 5, 2024

Choose a reason for hiding this comment

jedwins1998 Sep 11, 2024

Choose a reason for hiding this comment

danielvegamyhre Sep 8, 2024

Choose a reason for hiding this comment

jedwins1998 Sep 11, 2024

Choose a reason for hiding this comment

ahg-g commented Sep 13, 2024

danielvegamyhre commented Sep 20, 2024

k8s-ci-robot commented Sep 20, 2024

danielvegamyhre commented Sep 20, 2024

netlify bot commented Jun 10, 2024 •

edited

Loading