Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP health checks should be enabled on CAPVCD LBs #2711

Closed
vxav opened this issue Aug 9, 2023 · 15 comments
Closed

TCP health checks should be enabled on CAPVCD LBs #2711

vxav opened this issue Aug 9, 2023 · 15 comments
Assignees
Labels
team/bigmac Team BigMac team/rocket Team Rocket

Comments

@vxav
Copy link

vxav commented Aug 9, 2023

By default CAPVCD creates the load balancer for the kube API with TCP health check enabled but the CPI doesn't for services type load balancer.

This is an issue as we only run 2 NGINX controllers and the svc is set to externalTrafficPolicy: local so the customers was seeing failed requests. This was fixed by enabling TCP health check on the LB.

https://gigantic.slack.com/archives/CE92C4BST/p1691484403096619

@vxav vxav added the team/rocket Team Rocket label Aug 9, 2023
@vxav
Copy link
Author

vxav commented Aug 10, 2023

It is automatically set if protocol=TCP (which it is and I validated the behaviour with a test service.

https://github.com/vmware/cloud-provider-for-cloud-director/blob/main/pkg/vcdsdk/gateway.go#L779

Created a new test cluster with Nginx and it has tcp check enabled...

@vxav
Copy link
Author

vxav commented Aug 10, 2023

Turns out the TCP check is removed when adding/removing nodes (only actions tested) but probably when the LB is reconciled in general.
https://kubernetes.slack.com/archives/C04JFT7GDGR/p1691657759485699?thread_ts=1691491800.665469&cid=C04JFT7GDGR

I create an issue upstream : vmware/cloud-provider-for-cloud-director#281

Arun assigned it to someone.

@vxav vxav self-assigned this Aug 14, 2023
@vxav
Copy link
Author

vxav commented Nov 9, 2023

We need to test it with CPI 1.4.0
Edit: same behaviour with 1.4.1

@teemow teemow added this to Roadmap Nov 20, 2023
@teemow teemow moved this to Blocked / Waiting ⛔️ in Roadmap Nov 20, 2023
@gawertm gawertm moved this from Blocked / Waiting ⛔️ to Up Next ➡️ in Roadmap May 14, 2024
@gawertm gawertm moved this from Up Next ➡️ to Backlog 📦 in Roadmap Jul 11, 2024
@gawertm gawertm moved this from Backlog 📦 to Up Next ➡️ in Roadmap Jul 15, 2024
@architectbot architectbot added the team/bigmac Team BigMac label Jul 15, 2024
@anvddriesch anvddriesch moved this from Up Next ➡️ to In Progress ⛏️ in Roadmap Jul 16, 2024
@anvddriesch
Copy link

anvddriesch commented Jul 17, 2024

I have a hypothesis for the bug where health checks are removed from lbs in VCD.
It seems that every time we go through the update, health monitor is set to nil unless a protocol is specified: https://github.com/vmware/cloud-provider-for-cloud-director/blob/2d6e9efaf23df037d4cb1d58484bf04b3f2b8935/pkg/vcdsdk/gateway.go#L938-L941
So it seems likely that this is where the removal could theoretically happen.
I checked and originally the protocol comes from here where we map ports: https://github.com/vmware/cloud-provider-for-cloud-director/blob/2d6e9efaf23df037d4cb1d58484bf04b3f2b8935/pkg/ccm/loadbalancer.go#L166-L168
Interestingly the value is not from the protocol field (default TCP) but from the appProtocol field which does not have a default and can be nil.
I need to try and create a cluster with a loadbalancer service to see whether something happens to that value.

@vxav
Copy link
Author

vxav commented Jul 17, 2024

Nice find! That looks promising 🙂. Let me know if you need a test manifest to create a cluster and I can show you what it looks like in VCD.

@anvddriesch
Copy link

Alright so rather than the appProtocol not being set at all and thus resulting in removal of the health monitor, it looks a little different.
It appears that the TCP health check is removed during upgrade because TCP will never be set here https://github.com/vmware/cloud-provider-for-cloud-director/blob/2d6e9efaf23df037d4cb1d58484bf04b3f2b8935/pkg/vcdsdk/gateway.go#L940
We already know this value comes from the appProtocol and this one is identical to the name of the port, so for the http one it is HTTP and for the https one it is HTTPS.
We seem to replace the existing TCP health check for this reason. In order to fix it we would need to set the value to protocol instead. However, we need to make sure that this does not break something else.

@anvddriesch
Copy link

I was a bit confused that the whole thing is removed and not replaced by HTTP/HTTPS type when testing. However, the reason for this is here: https://github.com/vmware/cloud-provider-for-cloud-director/blob/2d6e9efaf23df037d4cb1d58484bf04b3f2b8935/pkg/vcdsdk/gateway.go#L751-L752
Before the pool is formed we check if the HM type is TCP and if it is not, it will be discarded. So this whole thing can never work.

Now the interesting twist is that trying to change the appProtocol in the service to tcp did not result in a success, either. The monitor was still removed. So I need to check further.

@anvddriesch
Copy link

anvddriesch commented Jul 22, 2024

I deployed this fix giantswarm/cloud-provider-for-cloud-director#4 and it appears that tcp health checks are not removed anymore when machine deployment is scaled.
It's a small change so it may be accepted upstream.
edit:vmware/cloud-provider-for-cloud-director#376

@vxav
Copy link
Author

vxav commented Jul 25, 2024

Very nice 💪
Can we update the controller ing glasgow to test it ?

@anvddriesch
Copy link

Yeah sure :) do we need to do something to update it or will it be rolled out with the new release?

@vxav
Copy link
Author

vxav commented Jul 25, 2024

I think the easiest is to change the image on the deployment because otherwise we need to create a branch of the app collection and use that branch which is a bit complex for this.

@anvddriesch
Copy link

I have edited the deployment in glasgow.
The logs look okay. However, for some reason the status for the ingress nginx service loadbalancer is empty.
I'll try to find out how to check for this on the infrastructure side.

@vxav
Copy link
Author

vxav commented Jul 29, 2024

@anvddriesch
Copy link

good news - I scaled glasgow up and that was enough for the health monitors to come back :)

@anvddriesch anvddriesch moved this from In Progress ⛏️ to Blocked / Waiting ⛔️ in Roadmap Jul 31, 2024
@anvddriesch
Copy link

closing since its fixed from our side

@github-project-automation github-project-automation bot moved this from Blocked / Waiting ⛔️ to Done ✅ in Roadmap Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team/bigmac Team BigMac team/rocket Team Rocket
Projects
Archived in project
Development

No branches or pull requests

3 participants