Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BACK-2633] soft failure when running dexcom tasks #656

Merged
merged 12 commits into from
Aug 30, 2023
Merged

Conversation

jh-bate
Copy link
Contributor

@jh-bate jh-bate commented Aug 22, 2023

Allow "soft failure" for dexcom task runner

  • only implemented for dexcom tasks
  • allow the task to be "retried" three times
  • on the third it will either pass or actually fail the task and not retry again

FYI context from Darin:

I think, IIRC, what I would recommend is to:
• Add a new dynamic property of the Data field of task. TaskCreate (I think) that captures an "error count" (or "retry count" or similar).
• Only invoke tsk. AppendError if the "error count" exceeds some maximum. Otherwise, just exit that task run, but do not mark the task as failed.
• Add this functionality to fetch. Runner. Run in dexcom/fetch/runner.go at line 130 to allow a number of refresh token failures before truly failing the task.
• You'll probably want to add similar functionality throughout fetch. TaskRunner. Run in dexcom/fetch/runner. go in lines 193-213 to be more tolerant of any/all intermittent Dexcom API-related failures.
• Reset the error count upon a fully successful task run.
This should allow for any transient Dexcom API failures to not cause the task to immediately fail. You'll likely need to play around with the maximum "error count" to find a value that detects true failures without causing a large number of unnecessary requests. Perhaps start at a value of 5.

@jh-bate jh-bate changed the title soft failure when running dexcom tasks [BACK-2633] soft failure when running dexcom tasks Aug 22, 2023
@jh-bate jh-bate marked this pull request as ready for review August 22, 2023 03:30
@jh-bate jh-bate requested a review from tjotala August 22, 2023 03:32
@tjotala tjotala requested a review from toddkazakov August 22, 2023 14:44
if taskRunner, tErr := NewTaskRunner(r, tsk); tErr != nil {
tsk.AppendError(errors.Wrap(tErr, "unable to create task runner"))
ErrorOrRetryTask(tsk, errors.Wrap(tErr, "unable to create task runner"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this block is executed, this means that there is bug in the code. It's better not to fail the task here, but keep on retrying (until the bug is eventually fixed).

} else if tErr = taskRunner.Run(ctx); tErr != nil {
tsk.AppendError(errors.Wrap(tErr, "unable to run task runner"))
ErrorOrRetryTask(tsk, errors.Wrap(tErr, "unable to run task runner"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the only place where we should fail the task if retries are exceeded.

} else if tErr = taskRunner.Run(ctx); tErr != nil {
tsk.AppendError(errors.Wrap(tErr, "unable to run task runner"))
ErrorOrRetryTask(tsk, errors.Wrap(tErr, "unable to run task runner"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should log the error here, otherwise we won't know why a task has been retried.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error is logged when we fail the task originally not all errors caused a task failure within taskRunner.Run


if serverSessionToken, sErr := r.AuthClient().ServerSessionToken(); sErr != nil {
tsk.AppendError(errors.Wrap(sErr, "unable to get server session token"))
ErrorOrRetryTask(tsk, errors.Wrap(sErr, "unable to get server session token"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't fail the task if we can't obtain a session token from our authentication service, otherwise we'll fail all tasks if the auth service is down for whatever reason. Retries should be unlimited too.

@tjotala tjotala removed their request for review August 28, 2023 20:55
@jh-bate jh-bate changed the base branch from dexcom-connection to master August 28, 2023 23:02
@jh-bate
Copy link
Contributor Author

jh-bate commented Aug 29, 2023

/deploy qa1 task

@tidebot
Copy link
Collaborator

tidebot commented Aug 29, 2023

jh-bate updated values.yaml file in qa1

@tidebot
Copy link
Collaborator

tidebot commented Aug 29, 2023

jh-bate updated flux policies file in qa1

@tidebot
Copy link
Collaborator

tidebot commented Aug 29, 2023

jh-bate deployed platform dexcom-soft-failure branch to qa1 namespace

if ok {
count++
t.Data[dexcomTaskRetryField] = count
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth resetting this to 1 if it's not an int?

@jh-bate jh-bate dismissed toddkazakov’s stale review August 30, 2023 01:11

changes were made and has now been reviewed by others

@jh-bate
Copy link
Contributor Author

jh-bate commented Aug 30, 2023

/deploy qa2 task

@tidebot
Copy link
Collaborator

tidebot commented Aug 30, 2023

jh-bate updated values.yaml file in qa2

@tidebot
Copy link
Collaborator

tidebot commented Aug 30, 2023

jh-bate updated flux policies file in qa2

@tidebot
Copy link
Collaborator

tidebot commented Aug 30, 2023

jh-bate deployed platform dexcom-soft-failure branch to qa2 namespace

@jh-bate jh-bate mentioned this pull request Aug 30, 2023
@jh-bate jh-bate merged commit a7f9eba into master Aug 30, 2023
3 checks passed
@jh-bate jh-bate deleted the dexcom-soft-failure branch September 18, 2023 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants