-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log job failure even when it does not cause a change in task state. #6169
base: 8.3.x
Are you sure you want to change the base?
Conversation
73714c8
to
8f20ab0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I don't think it's this simple.
- I think this diff means that polled log messages for task failure will go back to being duplicated.
- This only covers failure, but submission failure also has retries so may require similar treatment.
- I think the failures before retries are exhausted will now get logged at CRITICAL level rather than INFO.
I think that you have a particular closed issue in mind, but I can't find it... Can you point it out to me?
I think that submission failure is already handled correctly - it certainly is in the simplistic case where you feed it
These are logged at critical - and I think they should be?
This would be consistent with submit failure... |
2c7e480
to
3cedf2f
Compare
No, I'm not thinking of the other log message duplication issue. The change made here bypassed logic that was used for suppressing duplicate log messages (the 8f20ab0#diff-d6de42ef75ecc801c26a6be3a9dc4885b64db89d32bce6f07e319595257b9b2eL930 However, in your more recent "fix" commit, you have put this back the way it was before: 3cedf2f#diff-d6de42ef75ecc801c26a6be3a9dc4885b64db89d32bce6f07e319595257b9b2eR930 |
This does not apply to submit failure, because submit failure will always log a critical warning through the
|
3cedf2f
to
1341355
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Test failures)
1341355
to
ce0498e
Compare
@oliver-sanders & @MetRonnie : Sorry, but I made a total hash of rebasing this and may have introduced some regressions. :( I think that I've been through all your comments, but please double check. |
The first commit doesn't appear to be associated with your GH account? |
080d88e
to
7df99a5
Compare
I think that I've messed up the setup of NewSpiceVDI. |
You'll either need to add a mailmap entry for the new setup, or use |
4ff1fde
to
ee0770f
Compare
…e. when there is a retry set up).
ee0770f
to
502b63b
Compare
Fixed - I'm not adding a |
@@ -0,0 +1 @@ | |||
Ensure that job failure is logged, even when the presence of retries causes the task not to change state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The task does change state in this case, but to waiting rather than failed.
@wxtim - I made a duplicate attempt at this without realizing you'd already worked on it, sorry. I thought I'd found a small bug with no associated issue. #6401 My bad, but having done it we might as well compare approaches. Both work, but mine is simpler (a one-liner) and I think the result is more consistent between submission and execution failure - see example below. So my feeling is, we should use my branch, but cherry-pick your integration test to it. Would you agree? [scheduling]
[[graph]]
R1 = """
a & b
"""
[runtime]
[[a]]
script = """
cylc broadcast -n a -s "script = true" $CYLC_WORKFLOW_ID
cylc broadcast -n b -s "platform = " $CYLC_WORKFLOW_ID
false
"""
execution retry delays = PT5S
[[b]]
platform = fake
submission retry delays = PT5S Log comparison (left me, right you): |
@hjoliver The CRITICAL level is probably too much though? Surely WARNING is the right level? |
Perhaps, but @wxtim's approach still leaves submit-fail (with a retry) at CRITICAL - hence my consistency comment above. Why treat the two differently? The level is arguable. I think it's OK to log the actual job or job submission failure as critical, but have the workflow then handle it automatically. |
I think that the correct level is error. @hjoliver - does your PR fall victim to any of Oliver's comments from #6169 (review)? |
If execution/submission retry delays are configured, then execution/submission failures (respectively) are expected to occur. Therefore it is not a CRITICAL message to log. Only if the retries are exhausted should it be a CRITICAL level message? |
Closes #6151
Check List
CONTRIBUTING.md
and added my name as a Code Contributor.setup.cfg
(andconda-environment.yml
if present).CHANGES.md
entry included if this is a change that can affect users?.?.x
branch.