Retry bulk export requests that yield 5xx server errors #336

mikix · 2024-08-06T19:30:12Z

Retry each request four times (for a total of five requests) in an exponential backoff of 1, 2, 4, and 8 minutes (total of 15 minutes).

This should hopefully help when dealing with flaky EHRs.

Fixes: #334

Checklist

Consider if documentation (like in docs/) needs to be updated
Consider if tests should be added

mikix · 2024-08-06T19:31:20Z

cumulus_etl/fhir/fhir_client.py

-            client_base_url = re.sub(r"/Patient/?$", "/", client_base_url)
-            client_base_url = re.sub(r"/Group/[^/]+/?$", "/", client_base_url)
+            client_base_url = re.sub(r"/Patient($|/.*)", "/", client_base_url)
+            client_base_url = re.sub(r"/Group/[^/]+($|/.*)", "/", client_base_url)


Unrelated, but I discovered that now that we support input URLs with arguments like /$export?_type=xxx, this regex that tries to find the root for authentication purposes needed to handle that better.

dogversioning · 2024-08-06T19:48:46Z

cumulus_etl/loaders/fhir/bulk_export.py

+        # These times are extremely generous - partly because we can afford to be
+        # as a long-running async task and partly because EHR servers seem prone to
+        # outages that clear up after a bit.
+        error_retry_minutes = [1, 2, 4, 8]  # and then raise


is there any value in allowing a user to specify the number of retries?

So far, I've just put retrying on the "application" side of the http code - i.e. not directly in FhirClient, but rather in the BulkExporter code.

Partly because I wanted to give myself the freedom to be a bit inflexible with it for now. Like... I dunno, is the user really going to want to specify number of retries? Feels too nitty gritty - if after 15minutes, a server still has (recoverable) problems, I'd rather just hear about it and then add another hard-coded retry.

The other odd thing about this retry code is that it's (intentionally) so slow. Most "exponential backoff" code out there is like, "wait 2 seconds, then 4, then 20 seconds" and I'm out here waiting 8 minutes.

So when I was thinking of building flexibility into FhirClient for number of retries and how long, just to use it in one place with weird values, I figured - just build the inflexible version first.

If we want retrying elsewhere in the ETL later, we can push this down a layer into FhirClient, with more flexibility. Does that sound OK?

yeah that's perfectly reasonable

dogversioning · 2024-08-06T19:50:33Z

cumulus_etl/loaders/fhir/bulk_export.py

+
+                # Calculate how long to wait, with a basic exponential backoff for errors.
+                if num_errors:
+                    default_delay = error_retry_minutes[num_errors - 1] * 60


nit - should there be one unified backoff scheme? should it be in one place

github-actions · 2024-08-06T20:01:45Z

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines	Covered	Coverage	Threshold	Status
3300	3206	97%	97%	🟢

New Files

No new covered files...

Modified Files

File	Coverage	Status
cumulus_etl/fhir/fhir_client.py	100%	🟢
cumulus_etl/loaders/fhir/bulk_export.py	100%	🟢
TOTAL	100%	🟢

updated for commit: 58bfa03 by action🐍

Retry each request four times (for a total of five requests) in an exponential backoff of 1, 2, 4, and 8 minutes (total of 15 minutes). This should hopefully help when dealing with flaky EHRs.

mikix commented Aug 6, 2024

View reviewed changes

mikix force-pushed the mikix/bulk-retry branch from 9882449 to baf297a Compare August 6, 2024 19:43

mikix marked this pull request as ready for review August 6, 2024 19:43

mikix force-pushed the mikix/bulk-retry branch from baf297a to 2beccef Compare August 6, 2024 19:45

dogversioning approved these changes Aug 6, 2024

View reviewed changes

mikix force-pushed the mikix/bulk-retry branch 2 times, most recently from 532ae18 to 9944b71 Compare August 7, 2024 12:54

Retry bulk export requests that yield 5xx server errors

58bfa03

Retry each request four times (for a total of five requests) in an exponential backoff of 1, 2, 4, and 8 minutes (total of 15 minutes). This should hopefully help when dealing with flaky EHRs.

mikix force-pushed the mikix/bulk-retry branch from 9944b71 to 58bfa03 Compare August 7, 2024 13:00

mikix merged commit d5662b9 into main Aug 7, 2024
3 checks passed

mikix deleted the mikix/bulk-retry branch August 7, 2024 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry bulk export requests that yield 5xx server errors #336

Retry bulk export requests that yield 5xx server errors #336

mikix commented Aug 6, 2024 •

edited

Loading

mikix Aug 6, 2024

dogversioning Aug 6, 2024

mikix Aug 6, 2024

dogversioning Aug 7, 2024

dogversioning Aug 6, 2024

github-actions bot commented Aug 6, 2024 •

edited

Loading

Retry bulk export requests that yield 5xx server errors #336

Retry bulk export requests that yield 5xx server errors #336

Conversation

mikix commented Aug 6, 2024 • edited Loading

Checklist

mikix Aug 6, 2024

Choose a reason for hiding this comment

dogversioning Aug 6, 2024

Choose a reason for hiding this comment

mikix Aug 6, 2024

Choose a reason for hiding this comment

dogversioning Aug 7, 2024

Choose a reason for hiding this comment

dogversioning Aug 6, 2024

Choose a reason for hiding this comment

github-actions bot commented Aug 6, 2024 • edited Loading

☂️ Python Coverage

Overall Coverage

New Files

Modified Files

mikix commented Aug 6, 2024 •

edited

Loading

github-actions bot commented Aug 6, 2024 •

edited

Loading