Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with application status #195

Open
herodes1991 opened this issue Nov 15, 2022 · 5 comments
Open

Problem with application status #195

herodes1991 opened this issue Nov 15, 2022 · 5 comments

Comments

@herodes1991
Copy link
Contributor

We just detected a problem with the application status result. We have a user that set a smaller readiness timeout than the liveness and the app is constantly finishing in a FAILED status. What do you think about getting the grater value between readiness and liveness?

app_spec.health_checks.readiness.initial_delay_seconds

@oyvindio
Copy link
Member

We just detected a problem with the application status result. We have a user that set a smaller readiness timeout than the liveness and the app is constantly finishing in a FAILED status.

I think some more details might be necessary to understand what the problem is in this case. Does the deployment rollout complete successfully at the Kubernetes level? How are the healthchecks for this application configured? Is it possible to create an example application configuration/application resource which can reproduce the issue?

If the deployment rollout does not complete successfully, the healthchecks/readiness probe for the application might need to be adjusted. If the rollout completes successfully but takes longer than the ReadyCheck timeout because of external factors such as image pull or pod scheduling delays, one option could be to increase the ready-check-timeout-multiplier config flag. Note that this will increase the effective ReadyCheck timeout for all applications deployed by the fiaas-deploy-daemon instance.

What do you think about getting the grater value between readiness and liveness?

ReadyCheck uses initial_delay_seconds from the readiness probe because during a deployment rollout it is necessary to wait for the readiness probes for each pod to go to Success to ensure that an appropriate number of pods are available at all times. My understanding is that since the default state of liveness probes is Success, the deployment controller does not wait for the liveness probe initial_delay_seconds during rollout. Liveness probes can only influence the result of a rollout by failing.
ReadyCheck tries to determine whether the deployment rollout was successful within a timeout based on an estimate of how long a rollout might take to complete. If liveness probe initial_delay_seconds isn't a factor in how much time rollouts take in practice, then I'm not sure that using it to calculate the ReadyCheck timeout is an ideal solution.

@herodes1991
Copy link
Contributor Author

herodes1991 commented Nov 17, 2022

Yes, the user used initial_delay_seconds for liveness probe as 30 with some retrying, and the readiness as 3. We have configured in the full namespace the ready-check-timeout-multiplier as 30. The application was configured with 1 replica.
What happened?
The app needs more than 90 seconds to become healthy (approximately 120 seconds) but the fiaas application status becomes FAILED after 90 seconds (30*3) 😔

@mortenlj
Copy link
Member

The app needs more than 90 seconds to become healthy (approximately 120 minutes)

Is this a typo, or does the app need 120 minutes to become ready to serve requests?

If so, you need to tell them to go back and build better apps. That kind of thing is not something that can or should be solved in the platform, that needs to be solved in application code.

@herodes1991
Copy link
Contributor Author

herodes1991 commented Nov 17, 2022

oh, yes, it was a typo, it was 120 seconds 😅

@oyvindio
Copy link
Member

Yes, the user used initial_delay_seconds for liveness probe as 30 with some retrying, and the readiness as 3. We have configured in the full namespace the ready-check-timeout-multiplier as 30. The application was configured with 1 replica.
What happened?
The app needs more than 90 seconds to become healthy (approximately 120 seconds) but the fiaas application status becomes FAILED after 90 seconds (30*3) 😔

If you mean that it takes approximately 120 seconds before the readiness probe is successful (often or always), then it sounds to me like the application would benefit from a longer initial_delay_seconds on the readiness probe to accommodate the time it actually takes it to become ready. This should allow more time for the rollout to complete at the Kubernetes level, and would also increase the effective timeout in ReadyCheck. Does that seem like a reasonable solution, or is there a reason why readiness initial delay can't be increased?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants