Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry bulk export requests that yield 5xx server errors #336

Merged
merged 1 commit into from
Aug 7, 2024

Conversation

mikix
Copy link
Contributor

@mikix mikix commented Aug 6, 2024

Retry each request four times (for a total of five requests) in an exponential backoff of 1, 2, 4, and 8 minutes (total of 15 minutes).

This should hopefully help when dealing with flaky EHRs.

Fixes: #334

Checklist

  • Consider if documentation (like in docs/) needs to be updated
  • Consider if tests should be added

Comment on lines 262 to 266
client_base_url = re.sub(r"/Patient/?$", "/", client_base_url)
client_base_url = re.sub(r"/Group/[^/]+/?$", "/", client_base_url)
client_base_url = re.sub(r"/Patient($|/.*)", "/", client_base_url)
client_base_url = re.sub(r"/Group/[^/]+($|/.*)", "/", client_base_url)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, but I discovered that now that we support input URLs with arguments like /$export?_type=xxx, this regex that tries to find the root for authentication purposes needed to handle that better.

@mikix mikix force-pushed the mikix/bulk-retry branch from 9882449 to baf297a Compare August 6, 2024 19:43
@mikix mikix marked this pull request as ready for review August 6, 2024 19:43
@mikix mikix force-pushed the mikix/bulk-retry branch from baf297a to 2beccef Compare August 6, 2024 19:45
# These times are extremely generous - partly because we can afford to be
# as a long-running async task and partly because EHR servers seem prone to
# outages that clear up after a bit.
error_retry_minutes = [1, 2, 4, 8] # and then raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any value in allowing a user to specify the number of retries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, I've just put retrying on the "application" side of the http code - i.e. not directly in FhirClient, but rather in the BulkExporter code.

Partly because I wanted to give myself the freedom to be a bit inflexible with it for now. Like... I dunno, is the user really going to want to specify number of retries? Feels too nitty gritty - if after 15minutes, a server still has (recoverable) problems, I'd rather just hear about it and then add another hard-coded retry.

The other odd thing about this retry code is that it's (intentionally) so slow. Most "exponential backoff" code out there is like, "wait 2 seconds, then 4, then 20 seconds" and I'm out here waiting 8 minutes.

So when I was thinking of building flexibility into FhirClient for number of retries and how long, just to use it in one place with weird values, I figured - just build the inflexible version first.

If we want retrying elsewhere in the ETL later, we can push this down a layer into FhirClient, with more flexibility. Does that sound OK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's perfectly reasonable


# Calculate how long to wait, with a basic exponential backoff for errors.
if num_errors:
default_delay = error_retry_minutes[num_errors - 1] * 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - should there be one unified backoff scheme? should it be in one place

Copy link

github-actions bot commented Aug 6, 2024

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
3300 3206 97% 97% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
cumulus_etl/fhir/fhir_client.py 100% 🟢
cumulus_etl/loaders/fhir/bulk_export.py 100% 🟢
TOTAL 100% 🟢

updated for commit: 58bfa03 by action🐍

@mikix mikix force-pushed the mikix/bulk-retry branch 2 times, most recently from 532ae18 to 9944b71 Compare August 7, 2024 12:54
Retry each request four times (for a total of five requests) in an
exponential backoff of 1, 2, 4, and 8 minutes (total of 15 minutes).

This should hopefully help when dealing with flaky EHRs.
@mikix mikix force-pushed the mikix/bulk-retry branch from 9944b71 to 58bfa03 Compare August 7, 2024 13:00
@mikix mikix merged commit d5662b9 into main Aug 7, 2024
3 checks passed
@mikix mikix deleted the mikix/bulk-retry branch August 7, 2024 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make bulk exporting more robust to transient errors
2 participants