-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry bulk export requests that yield 5xx server errors #336
Conversation
cumulus_etl/fhir/fhir_client.py
Outdated
client_base_url = re.sub(r"/Patient/?$", "/", client_base_url) | ||
client_base_url = re.sub(r"/Group/[^/]+/?$", "/", client_base_url) | ||
client_base_url = re.sub(r"/Patient($|/.*)", "/", client_base_url) | ||
client_base_url = re.sub(r"/Group/[^/]+($|/.*)", "/", client_base_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated, but I discovered that now that we support input URLs with arguments like /$export?_type=xxx
, this regex that tries to find the root for authentication purposes needed to handle that better.
# These times are extremely generous - partly because we can afford to be | ||
# as a long-running async task and partly because EHR servers seem prone to | ||
# outages that clear up after a bit. | ||
error_retry_minutes = [1, 2, 4, 8] # and then raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any value in allowing a user to specify the number of retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far, I've just put retrying on the "application" side of the http code - i.e. not directly in FhirClient, but rather in the BulkExporter code.
Partly because I wanted to give myself the freedom to be a bit inflexible with it for now. Like... I dunno, is the user really going to want to specify number of retries? Feels too nitty gritty - if after 15minutes, a server still has (recoverable) problems, I'd rather just hear about it and then add another hard-coded retry.
The other odd thing about this retry code is that it's (intentionally) so slow. Most "exponential backoff" code out there is like, "wait 2 seconds, then 4, then 20 seconds" and I'm out here waiting 8 minutes.
So when I was thinking of building flexibility into FhirClient for number of retries and how long, just to use it in one place with weird values, I figured - just build the inflexible version first.
If we want retrying elsewhere in the ETL later, we can push this down a layer into FhirClient, with more flexibility. Does that sound OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that's perfectly reasonable
|
||
# Calculate how long to wait, with a basic exponential backoff for errors. | ||
if num_errors: | ||
default_delay = error_retry_minutes[num_errors - 1] * 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - should there be one unified backoff scheme? should it be in one place
☂️ Python Coverage
Overall Coverage
New FilesNo new covered files... Modified Files
|
532ae18
to
9944b71
Compare
Retry each request four times (for a total of five requests) in an exponential backoff of 1, 2, 4, and 8 minutes (total of 15 minutes). This should hopefully help when dealing with flaky EHRs.
Retry each request four times (for a total of five requests) in an exponential backoff of 1, 2, 4, and 8 minutes (total of 15 minutes).
This should hopefully help when dealing with flaky EHRs.
Fixes: #334
Checklist
docs/
) needs to be updated