Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial checkin - Step Function cross-execution concurrency control pattern #49

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

junguo
Copy link

@junguo junguo commented Aug 29, 2023

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@junguo
Copy link
Author

junguo commented Aug 29, 2023

Description

This example demonstrate the implementation of cross-execution concurrency control for AWS Step Function workflows, by utilizing the listExecutions() API (https://docs.aws.amazon.com/step-functions/latest/apireference/API_ListExecutions.html).

Within a single flow, one can utilize Map or Distributed Map state to control how many concurrent flows can be launched within the same execution. However, there are use cases where one may want to limit the number of concurrent executions of the same workflow, for example, due to downstream API limitation or tasks that requires human intervention.

This is Issue #52

Implementation

Concurrency Controller function:

The concurrency controller Lambda function will check, for a given SFN ARN, the current number of executions using the listExecutions API. It then compares that against a preset concurrency threshold, a static value stored in SSM Parameter Store (for simplicity), and return a “proceed” or “wait” flag

Other considerations

  • Single SAM template is used to create all resources
  • Test runner: Lambda function that generates test messages to SQS (e.g., 1k - 10k)
  • SQS provides trigger for Concurrency controller Lambda function, with batch size of 1 and maximum concurrency set to 4 (to avoid ThrottlingException for the API call and racing condition)
  • A random delay up to 1 sec (jitter) is introduced when listExecutions is called to avoid racing condition
  • Concurrency Threshold set to 10 in SSM Param Store
  • listExecution() API call is eventual consistency and results are best effort (no SLA) → Concurrency can exceed threshold value on occasion
  • Concurrency can be tracked using CloudWatch Log Insight

@kitsunde
Copy link

kitsunde commented Dec 5, 2023

This is a check-then-act race condition where multiple execution will check the current execution count, believe it's below a threshold and schedule theirs going above the threshold value.

Jitter doesn't solve race conditions, it will just make them harder to find where 4 concurrent executions in this case will sometimes trigger a race condition depending on where the dice rolls falls. Jitter is really for traffic shaping.

You're better off using DynamoDB to control the concurrency which can do conditional writes, and will solve the issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants