Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: documentation for retrying failed work #142

Open
ryanwalls opened this issue Mar 15, 2016 · 5 comments
Open

Feature request: documentation for retrying failed work #142

ryanwalls opened this issue Mar 15, 2016 · 5 comments

Comments

@ryanwalls
Copy link
Contributor

I can't seem to figure out the best way to handle a few scenarios. Maybe some beefed up docs would help?

Some scenarios:
a) I want to restart a failed/terminated workflow from the last activity that failed after perhaps deploying new code.

b) I want to backoff retrying a specific activity that fails. E.g. I schedule an activity from the decider. The activity fails with a specific error. I want to retry again later.

I think that perhaps the ContinueWorkflowDecision and ManagedContinuations interceptor are related to this problem... but I can't figure out how to apply them.

@sclasen
Copy link
Owner

sclasen commented Mar 15, 2016

Hi @ryanwalls

For failed/terminated workflows it depends on what caused the termination. SWF terminated or user terminated?

SWF will terminate workflows that have gone longer than the StartToClose timeout, and it will also terminate workflows that have more than 25k events.

We typically have a long/always running workflow/activity that watches the ListClosedWorkflowExecutions endpoint and restarts SWF terminated workflows. This doesnt fix everything in all cases though.

can you give a little more detail here?

For backoff, one way you can achieve this is to simply use a timer. you can set an arbitrary payload in the Control field of the timer, which could even be a *swf.ScheduleActivityTaskDecision.

When the timer fires, handle it and read out and submit the ScheduleActivityTaskDecision.

@ryanwalls
Copy link
Contributor Author

Thanks @sclasen. For the first case, these are workflows that have failed for various unforeseen reasons. Currently we have the workflow fail when an activity fails. Perhaps the solution to this problem is to make the workflow itself more robust and just keep the workflow execution open instead of failing it anytime we see an error?

Even with it more robust, let's imagine we want to give the user a way to manually cancel/terminate a workflow execution, and then be able to restart it later. Seems like with all the SerializedState we could restart a workflow from where it last was running, instead of restarting from the beginning?

Thanks for the idea of the timer. Was already leaning that way, just didn't know if there was some other mechanism that you guys recommend. Timer will probably work great for our case.

@sclasen
Copy link
Owner

sclasen commented Mar 15, 2016

In our current approach, we never terminate or fail workflows from the FSM itself. We have logic to retry everything inside the FSM when activity or other failures happen.

Rarely, we need to externally terminate a workflow, and restart it. You can definitely do that by reading the full SerializedState out of the terminated workflow and setting it as the Input to the new execution. The workflow will be restarted in the proper state.

You can even edit the state data inside the SerializedState if necessary before restarting.

However, your FSM might not start doing anything, as you'd need to do 'something' in an OnStarted handler to handle the restart. So it does take some up front work to be able to smoothly handle restarts.

The ManagedContinuationsInterceptor does basically this, when a workflow gets to a certain age or has a certain number of events, the interceptor attempts to continue the workflow on each decision task, it does this by checking to see if anything is in flight, and if not, it continues the workflow, using the SerializedState as the input to the continued workflow.

Doing this, either manuallly or via the interceptor results in SerializedState.StateVersion being incremented properly across continuations/restarts as well.

@ryanwalls
Copy link
Contributor Author

Interesting. Yeah, we're using SWF as a processing pipeline. We drop in work, it does a lot of calculation steps, and outputs results and finishes. So our workflows are less like services than yours it sounds.

Will play around with the ManagedContinuationsInterceptor and querying the closed executions. Will let you know if I develop anything I can contribute back.

@ryanwalls
Copy link
Contributor Author

@sclasen So almost done implementing this... one question. In OnTimerFired decider, the last history event has an event type of EventTypeTimerFired which has the StartedEventId. How do I get the actual timer scheduled event (and therefore access to it's control field)?

This must be an obvious thing that I'm just missing....

EDIT before I even posted....

Figured it out while I was writing this, but will leave here for the next person.

The correlator has the timer info. So just use https://godoc.org/github.com/sclasen/swfsm/fsm#EventCorrelator.TimerInfo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants