Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated workflow sweeper to use workflowOffsetTimeout for in progress tasks #227

Conversation

danmiller192
Copy link
Contributor

Pull Request type

  • Bugfix
  • Feature
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • WHOSUSING.md
  • Other (please describe):

NOTE: Please remember to run ./gradlew spotlessApply to fix any format violations.

Changes in this PR

Preface: Currently when the sweeper processes a workflow and the task is in progress the sweeper picks up the task timeout and queues a message on the decider queue with this time to next process the workflow. This can have the effect of leaving workflows in limbo for long periods with their tasks not being scheduled until that timeout is reached.

This fix pr addresses this by changing unack method to just use the workflowOffsetTimeout. This means that by default the workflow will be reprocessed in 30 seconds rather than the default 60 mins.

Describe alternative implementation you have considered

I have looked at ways to push the workflow to the decider but due to the asynchronous nature this can on occasion fall out of step.

@v1r3n
Copy link
Collaborator

v1r3n commented Aug 6, 2024

@danmiller192 the current approach ensures that the workflows are not sweeped unless they need to be. Changing this will have implications in the systems with large no. of running workflows.

@danmiller192
Copy link
Contributor Author

danmiller192 commented Aug 6, 2024

@v1r3n totally agree, I think it would be better to avoid the extra load. I did test with some large workflows running about 600 workflows per min and it didn't seem to cause any performance issues. When running that volume without the fix I did see significant volume of tasks that would get stuck waiting. Overall I think I was intending this to be a workaround rather than a fix. I will try and see if I can find another way to address the issue.

@v1r3n
Copy link
Collaborator

v1r3n commented Sep 2, 2024

@danmiller192 what would be the advantage of sweeping a workflow before the task timeout?

@danmiller192 danmiller192 force-pushed the workflow_postpone_duration_fix branch from 91a50a9 to 6f8f421 Compare October 21, 2024 18:29
@danmiller192 danmiller192 force-pushed the workflow_postpone_duration_fix branch 2 times, most recently from 88b8071 to f695e0d Compare November 19, 2024 16:59
Updated workflow sweeper to use the workflowOffsetTimeout rather than the task timeout when the task is in progress.
@danmiller192 danmiller192 force-pushed the workflow_postpone_duration_fix branch 2 times, most recently from 9552f6f to 0c56867 Compare December 12, 2024 20:26
@danmiller192 danmiller192 force-pushed the workflow_postpone_duration_fix branch from 0c56867 to 8325245 Compare December 12, 2024 20:31
@danmiller192
Copy link
Contributor Author

I have updated the proposal to use a config value MaxPostponeDurationSeconds to allow for tasks that may have a long and varied response time. Creating a new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants