TCP health checks should be enabled on CAPVCD LBs #2711

vxav · 2023-08-09T08:54:05Z

By default CAPVCD creates the load balancer for the kube API with TCP health check enabled but the CPI doesn't for services type load balancer.

This is an issue as we only run 2 NGINX controllers and the svc is set to externalTrafficPolicy: local so the customers was seeing failed requests. This was fixed by enabling TCP health check on the LB.

https://gigantic.slack.com/archives/CE92C4BST/p1691484403096619

The text was updated successfully, but these errors were encountered:

vxav · 2023-08-10T07:24:56Z

It is automatically set if protocol=TCP (which it is and I validated the behaviour with a test service.

https://github.com/vmware/cloud-provider-for-cloud-director/blob/main/pkg/vcdsdk/gateway.go#L779

Created a new test cluster with Nginx and it has tcp check enabled...

vxav · 2023-08-10T08:56:46Z

Turns out the TCP check is removed when adding/removing nodes (only actions tested) but probably when the LB is reconciled in general.
https://kubernetes.slack.com/archives/C04JFT7GDGR/p1691657759485699?thread_ts=1691491800.665469&cid=C04JFT7GDGR

I create an issue upstream : vmware/cloud-provider-for-cloud-director#281

Arun assigned it to someone.

vxav · 2023-11-09T20:11:48Z

We need to test it with CPI 1.4.0
Edit: same behaviour with 1.4.1

anvddriesch · 2024-07-17T07:07:03Z

I have a hypothesis for the bug where health checks are removed from lbs in VCD.
It seems that every time we go through the update, health monitor is set to nil unless a protocol is specified: https://github.com/vmware/cloud-provider-for-cloud-director/blob/2d6e9efaf23df037d4cb1d58484bf04b3f2b8935/pkg/vcdsdk/gateway.go#L938-L941
So it seems likely that this is where the removal could theoretically happen.
I checked and originally the protocol comes from here where we map ports: https://github.com/vmware/cloud-provider-for-cloud-director/blob/2d6e9efaf23df037d4cb1d58484bf04b3f2b8935/pkg/ccm/loadbalancer.go#L166-L168
Interestingly the value is not from the protocol field (default TCP) but from the appProtocol field which does not have a default and can be nil.
I need to try and create a cluster with a loadbalancer service to see whether something happens to that value.

vxav · 2024-07-17T09:24:54Z

Nice find! That looks promising 🙂. Let me know if you need a test manifest to create a cluster and I can show you what it looks like in VCD.

anvddriesch · 2024-07-17T13:59:53Z

Alright so rather than the appProtocol not being set at all and thus resulting in removal of the health monitor, it looks a little different.
It appears that the TCP health check is removed during upgrade because TCP will never be set here https://github.com/vmware/cloud-provider-for-cloud-director/blob/2d6e9efaf23df037d4cb1d58484bf04b3f2b8935/pkg/vcdsdk/gateway.go#L940
We already know this value comes from the appProtocol and this one is identical to the name of the port, so for the http one it is HTTP and for the https one it is HTTPS.
We seem to replace the existing TCP health check for this reason. In order to fix it we would need to set the value to protocol instead. However, we need to make sure that this does not break something else.

anvddriesch · 2024-07-18T09:05:09Z

I was a bit confused that the whole thing is removed and not replaced by HTTP/HTTPS type when testing. However, the reason for this is here: https://github.com/vmware/cloud-provider-for-cloud-director/blob/2d6e9efaf23df037d4cb1d58484bf04b3f2b8935/pkg/vcdsdk/gateway.go#L751-L752
Before the pool is formed we check if the HM type is TCP and if it is not, it will be discarded. So this whole thing can never work.

Now the interesting twist is that trying to change the appProtocol in the service to tcp did not result in a success, either. The monitor was still removed. So I need to check further.

anvddriesch · 2024-07-22T07:27:47Z

I deployed this fix giantswarm/cloud-provider-for-cloud-director#4 and it appears that tcp health checks are not removed anymore when machine deployment is scaled.
It's a small change so it may be accepted upstream.
edit:vmware/cloud-provider-for-cloud-director#376

vxav · 2024-07-25T12:09:35Z

Very nice 💪
Can we update the controller ing glasgow to test it ?

anvddriesch · 2024-07-25T13:19:03Z

Yeah sure :) do we need to do something to update it or will it be rolled out with the new release?

vxav · 2024-07-25T13:40:47Z

I think the easiest is to change the image on the deployment because otherwise we need to create a branch of the app collection and use that branch which is a bit complex for this.

anvddriesch · 2024-07-29T07:08:54Z

I have edited the deployment in glasgow.
The logs look okay. However, for some reason the status for the ingress nginx service loadbalancer is empty.
I'll try to find out how to check for this on the infrastructure side.

vxav · 2024-07-29T07:28:30Z

Do you not need to update the cluster app to push the cpi change? https://github.com/giantswarm/giantswarm-management-clusters/blob/d81b3db2878084916419cb397732919982e13794/management-clusters/glasgow/cluster-app-manifests.yaml#L113

anvddriesch · 2024-07-29T14:02:01Z

good news - I scaled glasgow up and that was enough for the health monitors to come back :)

anvddriesch · 2024-08-01T10:06:59Z

closing since its fixed from our side

vxav added the team/rocket Team Rocket label Aug 9, 2023

vxav self-assigned this Aug 14, 2023

teemow added this to Roadmap Nov 20, 2023

teemow moved this to Blocked / Waiting ⛔️ in Roadmap Nov 20, 2023

gawertm moved this from Blocked / Waiting ⛔️ to Up Next ➡️ in Roadmap May 14, 2024

gawertm moved this from Up Next ➡️ to Backlog 📦 in Roadmap Jul 11, 2024

gawertm assigned anvddriesch Jul 15, 2024

gawertm moved this from Backlog 📦 to Up Next ➡️ in Roadmap Jul 15, 2024

architectbot added the team/bigmac Team BigMac label Jul 15, 2024

anvddriesch moved this from Up Next ➡️ to In Progress ⛏️ in Roadmap Jul 16, 2024

anvddriesch moved this from In Progress ⛏️ to Blocked / Waiting ⛔️ in Roadmap Jul 31, 2024

anvddriesch closed this as completed Aug 1, 2024

github-project-automation bot moved this from Blocked / Waiting ⛔️ to Done ✅ in Roadmap Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TCP health checks should be enabled on CAPVCD LBs #2711

TCP health checks should be enabled on CAPVCD LBs #2711

vxav commented Aug 9, 2023 •

edited

Loading

vxav commented Aug 10, 2023 •

edited

Loading

vxav commented Aug 10, 2023 •

edited

Loading

vxav commented Nov 9, 2023 •

edited

Loading

anvddriesch commented Jul 17, 2024 •

edited

Loading

vxav commented Jul 17, 2024

anvddriesch commented Jul 17, 2024

anvddriesch commented Jul 18, 2024

anvddriesch commented Jul 22, 2024 •

edited

Loading

vxav commented Jul 25, 2024

anvddriesch commented Jul 25, 2024

vxav commented Jul 25, 2024

anvddriesch commented Jul 29, 2024

vxav commented Jul 29, 2024 •

edited

Loading

anvddriesch commented Jul 29, 2024

anvddriesch commented Aug 1, 2024

TCP health checks should be enabled on CAPVCD LBs #2711

TCP health checks should be enabled on CAPVCD LBs #2711

Comments

vxav commented Aug 9, 2023 • edited Loading

vxav commented Aug 10, 2023 • edited Loading

vxav commented Aug 10, 2023 • edited Loading

vxav commented Nov 9, 2023 • edited Loading

anvddriesch commented Jul 17, 2024 • edited Loading

vxav commented Jul 17, 2024

anvddriesch commented Jul 17, 2024

anvddriesch commented Jul 18, 2024

anvddriesch commented Jul 22, 2024 • edited Loading

vxav commented Jul 25, 2024

anvddriesch commented Jul 25, 2024

vxav commented Jul 25, 2024

anvddriesch commented Jul 29, 2024

vxav commented Jul 29, 2024 • edited Loading

anvddriesch commented Jul 29, 2024

anvddriesch commented Aug 1, 2024

vxav commented Aug 9, 2023 •

edited

Loading

vxav commented Aug 10, 2023 •

edited

Loading

vxav commented Aug 10, 2023 •

edited

Loading

vxav commented Nov 9, 2023 •

edited

Loading

anvddriesch commented Jul 17, 2024 •

edited

Loading

anvddriesch commented Jul 22, 2024 •

edited

Loading

vxav commented Jul 29, 2024 •

edited

Loading