Problem with application status #195

herodes1991 · 2022-11-15T12:09:52Z

We just detected a problem with the application status result. We have a user that set a smaller readiness timeout than the liveness and the app is constantly finishing in a FAILED status. What do you think about getting the grater value between readiness and liveness?

fiaas-deploy-daemon/fiaas_deploy_daemon/deployer/kubernetes/ready_check.py

Line 39 in 727967d

app_spec.health_checks.readiness.initial_delay_seconds

oyvindio · 2022-11-17T13:52:15Z

We just detected a problem with the application status result. We have a user that set a smaller readiness timeout than the liveness and the app is constantly finishing in a FAILED status.

I think some more details might be necessary to understand what the problem is in this case. Does the deployment rollout complete successfully at the Kubernetes level? How are the healthchecks for this application configured? Is it possible to create an example application configuration/application resource which can reproduce the issue?

If the deployment rollout does not complete successfully, the healthchecks/readiness probe for the application might need to be adjusted. If the rollout completes successfully but takes longer than the ReadyCheck timeout because of external factors such as image pull or pod scheduling delays, one option could be to increase the ready-check-timeout-multiplier config flag. Note that this will increase the effective ReadyCheck timeout for all applications deployed by the fiaas-deploy-daemon instance.

What do you think about getting the grater value between readiness and liveness?

ReadyCheck uses initial_delay_seconds from the readiness probe because during a deployment rollout it is necessary to wait for the readiness probes for each pod to go to Success to ensure that an appropriate number of pods are available at all times. My understanding is that since the default state of liveness probes is Success, the deployment controller does not wait for the liveness probe initial_delay_seconds during rollout. Liveness probes can only influence the result of a rollout by failing.
ReadyCheck tries to determine whether the deployment rollout was successful within a timeout based on an estimate of how long a rollout might take to complete. If liveness probe initial_delay_seconds isn't a factor in how much time rollouts take in practice, then I'm not sure that using it to calculate the ReadyCheck timeout is an ideal solution.

herodes1991 · 2022-11-17T14:07:28Z

Yes, the user used initial_delay_seconds for liveness probe as 30 with some retrying, and the readiness as 3. We have configured in the full namespace the ready-check-timeout-multiplier as 30. The application was configured with 1 replica.
What happened?
The app needs more than 90 seconds to become healthy (approximately 120 seconds) but the fiaas application status becomes FAILED after 90 seconds (30*3) 😔

mortenlj · 2022-11-17T14:19:42Z

The app needs more than 90 seconds to become healthy (approximately 120 minutes)

Is this a typo, or does the app need 120 minutes to become ready to serve requests?

If so, you need to tell them to go back and build better apps. That kind of thing is not something that can or should be solved in the platform, that needs to be solved in application code.

herodes1991 · 2022-11-17T14:33:22Z

oh, yes, it was a typo, it was 120 seconds 😅

oyvindio · 2022-11-28T12:08:14Z

Yes, the user used initial_delay_seconds for liveness probe as 30 with some retrying, and the readiness as 3. We have configured in the full namespace the ready-check-timeout-multiplier as 30. The application was configured with 1 replica.
What happened?
The app needs more than 90 seconds to become healthy (approximately 120 seconds) but the fiaas application status becomes FAILED after 90 seconds (30*3) 😔

If you mean that it takes approximately 120 seconds before the readiness probe is successful (often or always), then it sounds to me like the application would benefit from a longer initial_delay_seconds on the readiness probe to accommodate the time it actually takes it to become ready. This should allow more time for the rollout to complete at the Kubernetes level, and would also increase the effective timeout in ReadyCheck. Does that seem like a reasonable solution, or is there a reason why readiness initial delay can't be increased?

herodes1991 mentioned this issue Nov 15, 2022

Use maximum between liveness and readiness for fail_time #196

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with application status #195

Problem with application status #195

herodes1991 commented Nov 15, 2022

oyvindio commented Nov 17, 2022

herodes1991 commented Nov 17, 2022 •

edited

Loading

mortenlj commented Nov 17, 2022

herodes1991 commented Nov 17, 2022 •

edited

Loading

oyvindio commented Nov 28, 2022

Problem with application status #195

Problem with application status #195

Comments

herodes1991 commented Nov 15, 2022

oyvindio commented Nov 17, 2022

herodes1991 commented Nov 17, 2022 • edited Loading

mortenlj commented Nov 17, 2022

herodes1991 commented Nov 17, 2022 • edited Loading

oyvindio commented Nov 28, 2022

herodes1991 commented Nov 17, 2022 •

edited

Loading

herodes1991 commented Nov 17, 2022 •

edited

Loading