You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.
Overview of the Issue
When attempting to test/implement Consul on Kubernetes failover for multi-cluster, AdminParition enabled clusters, two issues were uncovered regarding the failover behavior for a cluster containing both an AdminParition and additional Peered cluster membership failover where:
(Undocumented) When using the SamenessGroup CRD with spec.includeLocal: true option (and not listing the local partition as a member in spec.members), there would a conflict in outbound listener cluster creation due the following error:
error adding listener '127.0.0.1:15001': filter chain '' has the same matching rules defined as ''
This is currently undocumented and not easily able to be troubleshot when configured one of two given options for Sameness group failover
Workaround:
Do not use the spec.includeLocal option, and instead explicitly list the local partition in the list of members available for failover. This allows the outbound listener clusters to create as there are no longer any conflicts with the listener filter_chain_match(s) that get created.
(Bug) With ACLs enabled, the local cluster mesh-gateway is unable to create the required AdminParition failover cluster for a given service because the default ACL policy permissions are lacking. They'll default to:
Test proper failover by scaling the backend service down in dc1: dc1 -> dev (partition) -> dc2 (peer) -> dc3 (peer)
Note:Failover tests will fail as we're using the ProxyDefaults prioritizeByLocality.mode: failover` option to have the services prefer local services over remote services
k --context k3d-c1 -n consul scale deploy/backend --replicas=0
The failover test fail because the dc1 cluster thinks it has a healthy failover to dc1-dev admin partition, however, the mesh-gateway cannot properly form the dev partitions cluster.
Logs
Before correcting the following are shown from the upstream services Envoy (dataplane) trace logs (demonstrating lack of cluster creation, but knowledge of failover cluster):
2024-09-05T19:29:14.766Z+00:00 [trace] envoy.filter(25) tls inspector: new connection accepted
2024-09-05T19:29:14.766Z+00:00 [trace] envoy.misc(25) TcpListener accepted 1 new connections.
2024-09-05T19:29:14.766Z+00:00 [trace] envoy.filter(25) onFileEvent: 1
2024-09-05T19:29:14.766Z+00:00 [trace] envoy.filter(25) recv returned: 1001
2024-09-05T19:29:14.766Z+00:00 [trace] envoy.filter(25) tls inspector: recv: 1001
2024-09-05T19:29:14.766Z+00:00 [trace] envoy.filter(25) tls:onALPN(), ALPN: http/1.1
2024-09-05T19:29:14.767Z+00:00 [debug] envoy.filter(25) tls:onServerName(), requestedServerName: backend.consul.dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul
2024-09-05T19:29:14.767Z+00:00 [debug] envoy.filter(25) [Tags: "ConnectionId":"332"] new tcp proxy session
2024-09-05T19:29:14.767Z+00:00 [trace] envoy.connection(25) [Tags: "ConnectionId":"332"] readDisable: disable=true disable_count=0 state=0 buffer_length=0
2024-09-05T19:29:14.767Z+00:00 [trace] envoy.filter(25) [Tags: "ConnectionId":"332"] sni_cluster: new connection with server name backend.consul.dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul
2024-09-05T19:29:14.767Z+00:00 [debug] envoy.filter(25) [Tags: "ConnectionId":"332"] Cluster not found backend.consul.dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul and no on demand cluster set.
Expected behavior
The cluster should form for local partitions trusted as defined by the SamenessGroup configuration, which requires some change to the ACL policy created for the Mesh Gateway when admin partitions are enabled.
Mesh Gateway Cluster for Failover AdminPartition Created:
Tested on a local k3d multicluster environment with:
x4 k3d Clusters on Kubernetes v1.29.6
Consul Enterprise + Consul K8s/DP
hashicorp/consul-enterprise: 1.18.2-ent
hashicorp/consul-k8s-control-plane: 1.4.6
hashicorp/consul-dataplane: 1.4.3
Additional Context
This Issue/Bug Request is to accomplish two main things:
Correct the Mesh Gateway ACL Policy implementation via consul-k8s-control-plane so that when AdminPartitions are enabled, proper policies are in-place to account for multi-cluster service-mesh failover scenarios
Update documentation surrounding the configuration of Cluster Peering Failover scenarios and best practices for implementing.
The text was updated successfully, but these errors were encountered:
Community Note
Overview of the Issue
When attempting to test/implement Consul on Kubernetes failover for multi-cluster, AdminParition enabled clusters, two issues were uncovered regarding the failover behavior for a cluster containing both an AdminParition and additional Peered cluster membership failover where:
spec.includeLocal: true
option (and not listing the local partition as a member inspec.members
), there would a conflict in outbound listener cluster creation due the following error:Workaround:
spec.includeLocal
option, and instead explicitly list the local partition in the list of members available for failover. This allows the outbound listener clusters to create as there are no longer any conflicts with the listener filter_chain_match(s) that get created.Workaround:
The mesh-gateway ACL policy can be corrected to the below policy to get this working properly for local non-default admin partition failover:
Summary of required policy changes
Reproduction Steps
Deploy multi-cluster testing lab: https://github.com/hashicorp-support/consul-k3d-multicluster
git clone https://github.com/hashicorp-support/consul-k3d-multicluster.git
make install-release
make fake-services
make sameness-groups
Test proper failover by scaling the backend service down in dc1: dc1 -> dev (partition) -> dc2 (peer) -> dc3 (peer)
The failover test fail because the dc1 cluster thinks it has a healthy failover to dc1-dev admin partition, however, the mesh-gateway cannot properly form the dev partitions cluster.
Logs
Before correcting the following are shown from the upstream services Envoy (dataplane) trace logs (demonstrating lack of cluster creation, but knowledge of failover cluster):
Expected behavior
The cluster should form for local partitions trusted as defined by the SamenessGroup configuration, which requires some change to the ACL policy created for the Mesh Gateway when admin partitions are enabled.
Mesh Gateway Cluster for Failover AdminPartition Created:
Environment details
Tested on a local k3d multicluster environment with:
v1.29.6
Additional Context
This Issue/Bug Request is to accomplish two main things:
The text was updated successfully, but these errors were encountered: