Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log job failure even when it does not cause a change in task state. #6169

Open
wants to merge 1 commit into
base: 8.3.x
Choose a base branch
from

Conversation

wxtim
Copy link
Member

@wxtim wxtim commented Jun 25, 2024

Closes #6151

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • CHANGES.md entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@wxtim wxtim force-pushed the fix.task_fail_not_logged_if_retries branch from 73714c8 to 8f20ab0 Compare June 25, 2024 13:57
@wxtim wxtim self-assigned this Jun 25, 2024
@wxtim wxtim added bug Something is wrong :( small labels Jun 25, 2024
@wxtim wxtim added this to the 8.3.1 milestone Jun 25, 2024
@wxtim wxtim linked an issue Jun 25, 2024 that may be closed by this pull request
Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I don't think it's this simple.

  • I think this diff means that polled log messages for task failure will go back to being duplicated.
  • This only covers failure, but submission failure also has retries so may require similar treatment.
  • I think the failures before retries are exhausted will now get logged at CRITICAL level rather than INFO.

@wxtim wxtim marked this pull request as draft June 26, 2024 14:45
@oliver-sanders oliver-sanders modified the milestones: 8.3.1, 8.3.x Jun 27, 2024
@wxtim
Copy link
Member Author

wxtim commented Jul 15, 2024

I think this diff means that polled log messages for task failure will go back to being duplicated.

I think that you have a particular closed issue in mind, but I can't find it... Can you point it out to me?

This only covers failure, but submission failure also has retries so may require similar treatment.

I think that submission failure is already handled correctly - it certainly is in the simplistic case where you feed it platform=garbage you get a submission failed message at level critical:

CRITICAL - [1/bar/01:preparing] submission failed
CRITICAL - [1/bar/02:preparing] submission failed
CRITICAL - [1/bar/03:preparing] submission failed

These are logged at critical - and I think they should be?

I think the failures before retries are exhausted will now get logged at CRITICAL level rather than INFO.

This would be consistent with submit failure...

@wxtim wxtim force-pushed the fix.task_fail_not_logged_if_retries branch from 2c7e480 to 3cedf2f Compare July 15, 2024 10:19
@wxtim wxtim marked this pull request as ready for review July 15, 2024 10:20
@oliver-sanders
Copy link
Member

I think this diff means that polled log messages for task failure will go back to being duplicated.

I think that you have a particular closed issue in mind, but I can't find it... Can you point it out to me?

No, I'm not thinking of the other log message duplication issue.

The change made here bypassed logic that was used for suppressing duplicate log messages (the f'{FAIL_MESSAGE_PREFIX}ERR' bit):

8f20ab0#diff-d6de42ef75ecc801c26a6be3a9dc4885b64db89d32bce6f07e319595257b9b2eL930

However, in your more recent "fix" commit, you have put this back the way it was before:

3cedf2f#diff-d6de42ef75ecc801c26a6be3a9dc4885b64db89d32bce6f07e319595257b9b2eR930

oliver-sanders

This comment was marked as off-topic.

@oliver-sanders oliver-sanders dismissed their stale review July 15, 2024 12:32

because I'm blind

@wxtim wxtim marked this pull request as draft July 16, 2024 15:36
@wxtim
Copy link
Member Author

wxtim commented Jul 22, 2024

This does not apply to submit failure, because submit failure will always log a critical warning through the jobs-submit command.

WARNING - platform: mulberry - Could not connect to mymachine.
    * mymachine has been added to the list of unreachable hosts
    * remote-init will retry if another host is available.
ERROR - [jobs-submit cmd] (remote init)
    [jobs-submit ret_code] 1
CRITICAL - [1/bar/02:preparing] submission failed
INFO - [1/bar/02:preparing] => waiting
WARNING - [1/bar:waiting] retrying in PT1S (after 2024-07-22T14:51:37+01:00)

@wxtim wxtim force-pushed the fix.task_fail_not_logged_if_retries branch from 3cedf2f to 1341355 Compare July 22, 2024 13:59
@wxtim wxtim marked this pull request as ready for review July 22, 2024 13:59
Copy link
Member

@MetRonnie MetRonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Test failures)

@wxtim wxtim force-pushed the fix.task_fail_not_logged_if_retries branch from 1341355 to ce0498e Compare July 26, 2024 09:50
@wxtim wxtim requested a review from MetRonnie July 26, 2024 10:14
changes.d/fix.6169.md Outdated Show resolved Hide resolved
cylc/flow/task_events_mgr.py Show resolved Hide resolved
cylc/flow/task_events_mgr.py Outdated Show resolved Hide resolved
@MetRonnie MetRonnie removed the request for review from markgrahamdawson July 31, 2024 11:39
@wxtim wxtim marked this pull request as ready for review October 3, 2024 10:56
@wxtim
Copy link
Member Author

wxtim commented Oct 3, 2024

@oliver-sanders & @MetRonnie : Sorry, but I made a total hash of rebasing this and may have introduced some regressions. :(

I think that I've been through all your comments, but please double check.

@wxtim wxtim marked this pull request as draft October 3, 2024 15:29
changes.d/6169.fix Outdated Show resolved Hide resolved
@wxtim wxtim marked this pull request as ready for review October 3, 2024 16:05
@oliver-sanders
Copy link
Member

The first commit doesn't appear to be associated with your GH account?

@oliver-sanders oliver-sanders modified the milestones: 8.3.x, 8.3.5 Oct 4, 2024
@wxtim wxtim force-pushed the fix.task_fail_not_logged_if_retries branch from 080d88e to 7df99a5 Compare October 4, 2024 13:18
@wxtim
Copy link
Member Author

wxtim commented Oct 4, 2024

The first commit doesn't appear to be associated with your GH account?

I think that I've messed up the setup of NewSpiceVDI.

@oliver-sanders
Copy link
Member

You'll either need to add a mailmap entry for the new setup, or use git rebase to amend the commit to change the author, otherwise the check-contributors action will fail when next run.

@wxtim wxtim force-pushed the fix.task_fail_not_logged_if_retries branch 2 times, most recently from 4ff1fde to ee0770f Compare October 9, 2024 08:10
@wxtim wxtim force-pushed the fix.task_fail_not_logged_if_retries branch from ee0770f to 502b63b Compare October 9, 2024 08:13
@wxtim
Copy link
Member Author

wxtim commented Oct 9, 2024

You'll either need to add a mailmap entry for the new setup, or use git rebase to amend the commit to change the author, otherwise the check-contributors action will fail when next run.

Fixed - I'm not adding a metoffic.gov.uk email address to my history.

@oliver-sanders oliver-sanders requested review from hjoliver and removed request for MetRonnie October 14, 2024 09:56
@oliver-sanders
Copy link
Member

@hjoliver, swapping in you as a reviewer as you have also attempted this in #6401 and might want to compare approaches.

@oliver-sanders oliver-sanders modified the milestones: 8.3.5, 8.3.6 Oct 14, 2024
@@ -0,0 +1 @@
Ensure that job failure is logged, even when the presence of retries causes the task not to change state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The task does change state in this case, but to waiting rather than failed.

@hjoliver
Copy link
Member

hjoliver commented Oct 22, 2024

@wxtim - I made a duplicate attempt at this without realizing you'd already worked on it, sorry. I thought I'd found a small bug with no associated issue. #6401

My bad, but having done it we might as well compare approaches. Both work, but mine is simpler (a one-liner) and I think the result is more consistent between submission and execution failure - see example below.

So my feeling is, we should use my branch, but cherry-pick your integration test to it. Would you agree?

[scheduling]
    [[graph]]
        R1 = """
           a & b
        """
[runtime]
    [[a]]
        script = """
            cylc broadcast -n a -s "script = true" $CYLC_WORKFLOW_ID
            cylc broadcast -n b -s "platform = " $CYLC_WORKFLOW_ID
            false
        """
        execution retry delays = PT5S
    [[b]]
        platform = fake
        submission retry delays = PT5S

Log comparison (left me, right you):

shot

@MetRonnie
Copy link
Member

@hjoliver The CRITICAL level is probably too much though? Surely WARNING is the right level?

@hjoliver
Copy link
Member

hjoliver commented Oct 22, 2024

@hjoliver The CRITICAL level is probably too much though? Surely WARNING is the right level?

Perhaps, but @wxtim's approach still leaves submit-fail (with a retry) at CRITICAL - hence my consistency comment above. Why treat the two differently?

The level is arguable. I think it's OK to log the actual job or job submission failure as critical, but have the workflow then handle it automatically.

@wxtim
Copy link
Member Author

wxtim commented Oct 23, 2024

I think that the correct level is error.

@hjoliver - does your PR fall victim to any of Oliver's comments from #6169 (review)?

@MetRonnie
Copy link
Member

If execution/submission retry delays are configured, then execution/submission failures (respectively) are expected to occur. Therefore it is not a CRITICAL message to log. Only if the retries are exhausted should it be a CRITICAL level message?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :( small
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scheduler: task failure not logged when retries are configured
4 participants