Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cc.api_post_start_healthcheck_timeout_in_seconds doesn't seem to matter #230

Open
sethboyles opened this issue Mar 8, 2022 · 1 comment

Comments

@sethboyles
Copy link
Member

Issue

The cc.api_post_start_healthcheck_timeout_in_seconds field doesn't seem to have any effect.

Context

The cc.api_post_start_healthcheck_timeout_in_seconds config field seems to imply that the post-start will fail if CCNG fails report healthy within that time frame:

cc.api_post_start_healthcheck_timeout_in_seconds:
default: 60
description: "Maximum time (in seconds) for cloud_controller_ng to report healthy"

However, we pushed a changed to CCNG that slept for 10 hours before starting the thin server (in the runner), and this timeout did not get triggered. Instead, 20 minutes passed until both ccng_monit_http_healthcheck and nginx_cc failed.

Task 277 | 22:03:30 | L starting jobs: api/abf448d3-74e1-4086-80c2-130814373e14 (0) (canary)                                                                                                          Task 277 | 22:03:57 | Updating instance scheduler: scheduler/7d611e5a-1ada-41a4-b811-49b5dbcb2b2f (0) (canary) (00:02:09)
Task 277 | 22:23:32 | Updating instance api: api/abf448d3-74e1-4086-80c2-130814373e14 (0) (canary) (00:21:44)
                    L Error: 'api/abf448d3-74e1-4086-80c2-130814373e14 (0)' is not running after update. Review logs for failed jobs: ccng_monit_http_healthcheck, nginx_cc
Task 277 | 22:23:32 | Error: 'api/abf448d3-74e1-4086-80c2-130814373e14 (0)' is not running after update. Review logs for failed jobs: ccng_monit_http_healthcheck, nginx_cc

So it seems this check in the post start script:

wait_for_server_to_become_healthy "https://<%= discover_external_ip %>:${PORT}/healthz" "<%= p("cc.api_post_start_healthcheck_timeout_in_seconds") %>"

doesn't seem to matter anymore.

However, trying this experiment on older versions of CAPI, we observer that the deploy fails in the post-start script after the configured time.

We believe this regression was introduced in this PR: https://github.com/cloudfoundry/capi-release/pull/195/files, which reconfigured the monit dependencies between the processes.

What we are not clear on is if this is a regression that reintroduced the issue which prompted the introduction of that post-start check in the first place: #125.

Steps to Reproduce

Add a long sleep to this line: https://github.com/cloudfoundry/cloud_controller_ng/blob/main/lib%2Fcloud_controller%2Frunner.rb#L88 and deploy

Expected result

The post-start script to fail after the configured cc.api_post_start_healthcheck_timeout_in_seconds value

Current result

Deploy fails after 20 minutes.

Possible Fix

Not sure, is this something we need to address?

@philippthun
Copy link
Member

For me the check in the post-start script seems to have been a workaround for the wrong dependency chain. Now the update lifecycle fails at step 4 (monit start) rendering the workaround (i.e. check in step 5) being obsolete. So from my point of view the post-start script could be adjusted and the config property removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants