cc.api_post_start_healthcheck_timeout_in_seconds doesn't seem to matter #230

sethboyles · 2022-03-08T23:08:04Z

Issue

The cc.api_post_start_healthcheck_timeout_in_seconds field doesn't seem to have any effect.

Context

The cc.api_post_start_healthcheck_timeout_in_seconds config field seems to imply that the post-start will fail if CCNG fails report healthy within that time frame:

capi-release/jobs/cloud_controller_ng/spec

Lines 349 to 351 in 6f0f64c

    
           cc.api_post_start_healthcheck_timeout_in_seconds: 
        
             default: 60 
        
             description: "Maximum time (in seconds) for cloud_controller_ng to report healthy"

However, we pushed a changed to CCNG that slept for 10 hours before starting the thin server (in the runner), and this timeout did not get triggered. Instead, 20 minutes passed until both ccng_monit_http_healthcheck and nginx_cc failed.

Task 277 | 22:03:30 | L starting jobs: api/abf448d3-74e1-4086-80c2-130814373e14 (0) (canary)                                                                                                          Task 277 | 22:03:57 | Updating instance scheduler: scheduler/7d611e5a-1ada-41a4-b811-49b5dbcb2b2f (0) (canary) (00:02:09)
Task 277 | 22:23:32 | Updating instance api: api/abf448d3-74e1-4086-80c2-130814373e14 (0) (canary) (00:21:44)
                    L Error: 'api/abf448d3-74e1-4086-80c2-130814373e14 (0)' is not running after update. Review logs for failed jobs: ccng_monit_http_healthcheck, nginx_cc
Task 277 | 22:23:32 | Error: 'api/abf448d3-74e1-4086-80c2-130814373e14 (0)' is not running after update. Review logs for failed jobs: ccng_monit_http_healthcheck, nginx_cc

So it seems this check in the post start script:

capi-release/jobs/cloud_controller_ng/templates/post-start.sh.erb

Line 71 in 6f0f64c

    
           wait_for_server_to_become_healthy "https://<%= discover_external_ip %>:${PORT}/healthz" "<%= p("cc.api_post_start_healthcheck_timeout_in_seconds") %>"

doesn't seem to matter anymore.

However, trying this experiment on older versions of CAPI, we observer that the deploy fails in the post-start script after the configured time.

We believe this regression was introduced in this PR: https://github.com/cloudfoundry/capi-release/pull/195/files, which reconfigured the monit dependencies between the processes.

What we are not clear on is if this is a regression that reintroduced the issue which prompted the introduction of that post-start check in the first place: #125.

Steps to Reproduce

Add a long sleep to this line: https://github.com/cloudfoundry/cloud_controller_ng/blob/main/lib%2Fcloud_controller%2Frunner.rb#L88 and deploy

Expected result

The post-start script to fail after the configured cc.api_post_start_healthcheck_timeout_in_seconds value

Current result

Deploy fails after 20 minutes.

Possible Fix

Not sure, is this something we need to address?

The text was updated successfully, but these errors were encountered:

philippthun · 2022-03-21T12:37:34Z

For me the check in the post-start script seems to have been a workaround for the wrong dependency chain. Now the update lifecycle fails at step 4 (monit start) rendering the workaround (i.e. check in step 5) being obsolete. So from my point of view the post-start script could be adjusted and the config property removed.

cf-gitbot added the unscheduled label Mar 8, 2022

sethboyles mentioned this issue Mar 31, 2022

Use post_bbr_healthcheck_timeout_in_seconds for backup and restore #235

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cc.api_post_start_healthcheck_timeout_in_seconds doesn't seem to matter #230

cc.api_post_start_healthcheck_timeout_in_seconds doesn't seem to matter #230

sethboyles commented Mar 8, 2022

philippthun commented Mar 21, 2022

cc.api_post_start_healthcheck_timeout_in_seconds doesn't seem to matter #230

cc.api_post_start_healthcheck_timeout_in_seconds doesn't seem to matter #230

Comments

sethboyles commented Mar 8, 2022

Issue

Context

Steps to Reproduce

Expected result

Current result

Possible Fix

philippthun commented Mar 21, 2022