You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The cc.api_post_start_healthcheck_timeout_in_seconds field doesn't seem to have any effect.
Context
The cc.api_post_start_healthcheck_timeout_in_seconds config field seems to imply that the post-start will fail if CCNG fails report healthy within that time frame:
description: "Maximum time (in seconds) for cloud_controller_ng to report healthy"
However, we pushed a changed to CCNG that slept for 10 hours before starting the thin server (in the runner), and this timeout did not get triggered. Instead, 20 minutes passed until both ccng_monit_http_healthcheck and nginx_cc failed.
Task 277 | 22:03:30 | L starting jobs: api/abf448d3-74e1-4086-80c2-130814373e14 (0) (canary) Task 277 | 22:03:57 | Updating instance scheduler: scheduler/7d611e5a-1ada-41a4-b811-49b5dbcb2b2f (0) (canary) (00:02:09)
Task 277 | 22:23:32 | Updating instance api: api/abf448d3-74e1-4086-80c2-130814373e14 (0) (canary) (00:21:44)
L Error: 'api/abf448d3-74e1-4086-80c2-130814373e14 (0)' is not running after update. Review logs for failed jobs: ccng_monit_http_healthcheck, nginx_cc
Task 277 | 22:23:32 | Error: 'api/abf448d3-74e1-4086-80c2-130814373e14 (0)' is not running after update. Review logs for failed jobs: ccng_monit_http_healthcheck, nginx_cc
What we are not clear on is if this is a regression that reintroduced the issue which prompted the introduction of that post-start check in the first place: #125.
For me the check in the post-start script seems to have been a workaround for the wrong dependency chain. Now the update lifecycle fails at step 4 (monit start) rendering the workaround (i.e. check in step 5) being obsolete. So from my point of view the post-start script could be adjusted and the config property removed.
Issue
The
cc.api_post_start_healthcheck_timeout_in_seconds
field doesn't seem to have any effect.Context
The
cc.api_post_start_healthcheck_timeout_in_seconds
config field seems to imply that the post-start will fail if CCNG fails report healthy within that time frame:capi-release/jobs/cloud_controller_ng/spec
Lines 349 to 351 in 6f0f64c
However, we pushed a changed to CCNG that slept for 10 hours before starting the thin server (in the runner), and this timeout did not get triggered. Instead, 20 minutes passed until both
ccng_monit_http_healthcheck
andnginx_cc
failed.So it seems this check in the post start script:
capi-release/jobs/cloud_controller_ng/templates/post-start.sh.erb
Line 71 in 6f0f64c
doesn't seem to matter anymore.
However, trying this experiment on older versions of CAPI, we observer that the deploy fails in the post-start script after the configured time.
We believe this regression was introduced in this PR: https://github.com/cloudfoundry/capi-release/pull/195/files, which reconfigured the monit dependencies between the processes.
What we are not clear on is if this is a regression that reintroduced the issue which prompted the introduction of that post-start check in the first place: #125.
Steps to Reproduce
Add a long sleep to this line: https://github.com/cloudfoundry/cloud_controller_ng/blob/main/lib%2Fcloud_controller%2Frunner.rb#L88 and deploy
Expected result
The post-start script to fail after the configured
cc.api_post_start_healthcheck_timeout_in_seconds
valueCurrent result
Deploy fails after 20 minutes.
Possible Fix
Not sure, is this something we need to address?
The text was updated successfully, but these errors were encountered: