Consul on K8s: Peering Sameness Group Failover with AdminPartitions #4308

natemollica-nm · 2024-09-05T19:54:56Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

When attempting to test/implement Consul on Kubernetes failover for multi-cluster, AdminParition enabled clusters, two issues were uncovered regarding the failover behavior for a cluster containing both an AdminParition and additional Peered cluster membership failover where:

(Undocumented) When using the SamenessGroup CRD with spec.includeLocal: true option (and not listing the local partition as a member in spec.members), there would a conflict in outbound listener cluster creation due the following error:

error adding listener '127.0.0.1:15001': filter chain '' has the same matching rules defined as ''

This is currently undocumented and not easily able to be troubleshot when configured one of two given options for Sameness group failover

Workaround:

Do not use the spec.includeLocal option, and instead explicitly list the local partition in the list of members available for failover. This allows the outbound listener clusters to create as there are no longer any conflicts with the listener filter_chain_match(s) that get created.

(Bug) With ACLs enabled, the local cluster mesh-gateway is unable to create the required AdminParition failover cluster for a given service because the default ACL policy permissions are lacking. They'll default to:

mesh = "write"
peering = "read"
partition_prefix "" {
  peering = "read"
}
namespace "default" {
  service "mesh-gateway" {
     policy = "write"
  }
}
namespace_prefix "" {
  node_prefix "" {
  	policy = "read"
  }
  service_prefix "" {
     policy = "read"
  }
}

Workaround:

The mesh-gateway ACL policy can be corrected to the below policy to get this working properly for local non-default admin partition failover:

mesh = "write"
peering = "read"
partition_prefix "" {
  peering = "read"
  namespace_prefix "" {
    node_prefix "" {
      policy = "read"
    }
    service_prefix "" {
      policy = "read"
    }
  }
}
namespace "default" {
  service "mesh-gateway" {
    policy = "write"
  }
}

Summary of required policy changes

--- non-working-policy.hcl	2024-09-05 13:02:16
+++ working-policy.hcl	2024-09-05 13:03:42
@@ -2,17 +2,17 @@
 peering = "read"
 partition_prefix "" {
   peering = "read"
+  namespace_prefix "" {
+    node_prefix "" {
+      policy = "read"
+    }
+    service_prefix "" {
+      policy = "read"
+    }
+  }
 }
 namespace "default" {
   service "mesh-gateway" {
     policy = "write"
   }
 }
-namespace_prefix "" {
-  node_prefix "" {
-    policy = "read"
-  }
-  service_prefix "" {
-    policy = "read"
-  }
-}

Reproduction Steps

Deploy multi-cluster testing lab: https://github.com/hashicorp-support/consul-k3d-multicluster

Clone repository: git clone https://github.com/hashicorp-support/consul-k3d-multicluster.git
Run: make install-release
Run: make fake-services
Run: make sameness-groups

Test proper failover by scaling the backend service down in dc1: dc1 -> dev (partition) -> dc2 (peer) -> dc3 (peer)

Note: Failover tests will fail as we're using the ProxyDefaults prioritizeByLocality.mode: failover` option to have the services prefer local services over remote services

k --context k3d-c1 -n consul scale deploy/backend --replicas=0

The failover test fail because the dc1 cluster thinks it has a healthy failover to dc1-dev admin partition, however, the mesh-gateway cannot properly form the dev partitions cluster.

Logs

Before correcting the following are shown from the upstream services Envoy (dataplane) trace logs (demonstrating lack of cluster creation, but knowledge of failover cluster):

2024-09-05T19:29:14.766Z+00:00 [trace] envoy.filter(25) tls inspector: new connection accepted
2024-09-05T19:29:14.766Z+00:00 [trace] envoy.misc(25) TcpListener accepted 1 new connections.
2024-09-05T19:29:14.766Z+00:00 [trace] envoy.filter(25) onFileEvent: 1
2024-09-05T19:29:14.766Z+00:00 [trace] envoy.filter(25) recv returned: 1001
2024-09-05T19:29:14.766Z+00:00 [trace] envoy.filter(25) tls inspector: recv: 1001
2024-09-05T19:29:14.766Z+00:00 [trace] envoy.filter(25) tls:onALPN(), ALPN: http/1.1
2024-09-05T19:29:14.767Z+00:00 [debug] envoy.filter(25) tls:onServerName(), requestedServerName: backend.consul.dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul
2024-09-05T19:29:14.767Z+00:00 [debug] envoy.filter(25) [Tags: "ConnectionId":"332"] new tcp proxy session
2024-09-05T19:29:14.767Z+00:00 [trace] envoy.connection(25) [Tags: "ConnectionId":"332"] readDisable: disable=true disable_count=0 state=0 buffer_length=0
2024-09-05T19:29:14.767Z+00:00 [trace] envoy.filter(25) [Tags: "ConnectionId":"332"] sni_cluster: new connection with server name backend.consul.dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul
2024-09-05T19:29:14.767Z+00:00 [debug] envoy.filter(25) [Tags: "ConnectionId":"332"] Cluster not found backend.consul.dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul and no on demand cluster set.

Expected behavior

The cluster should form for local partitions trusted as defined by the SamenessGroup configuration, which requires some change to the ACL policy created for the Mesh Gateway when admin partitions are enabled.

Mesh Gateway Cluster for Failover AdminPartition Created:

dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::observability_name::dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::outlier::success_rate_average::-1
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::outlier::success_rate_ejection_threshold::-1
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::outlier::local_origin_success_rate_average::-1
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::outlier::local_origin_success_rate_ejection_threshold::-1
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::default_priority::max_connections::1024
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::default_priority::max_pending_requests::1024
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::default_priority::max_requests::1024
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::default_priority::max_retries::3
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::high_priority::max_connections::1024
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::high_priority::max_pending_requests::1024
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::high_priority::max_requests::1024
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::high_priority::max_retries::3
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::added_via_api::true
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::172.19.0.4:30101::cx_active::1
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::172.19.0.4:30101::cx_connect_fail::0
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::172.19.0.4:30101::cx_total::1
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::172.19.0.4:30101::rq_active::1
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::172.19.0.4:30101::rq_error::0
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::172.19.0.4:30101::rq_success::0
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::172.19.0.4:30101::rq_timeout::0
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::172.19.0.4:30101::rq_total::1
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::172.19.0.4:30101::hostname::
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::172.19.0.4:30101::health_flags::healthy
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::172.19.0.4:30101::weight::1
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::172.19.0.4:30101::region::
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::172.19.0.4:30101::zone::
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::172.19.0.4:30101::sub_zone::
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::172.19.0.4:30101::canary::false
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::172.19.0.4:30101::priority::0
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::172.19.0.4:30101::success_rate::-1
dev.dc1.internal-v1.6b7d59e1-70f9-79be-b54b-3c8e979b57d3.consul::172.19.0.4:30101::local_origin_success_rate::-1

Environment details

Tested on a local k3d multicluster environment with:

x4 k3d Clusters on Kubernetes v1.29.6
Consul Enterprise + Consul K8s/DP
- hashicorp/consul-enterprise: 1.18.2-ent
- hashicorp/consul-k8s-control-plane: 1.4.6
- hashicorp/consul-dataplane: 1.4.3

Additional Context

This Issue/Bug Request is to accomplish two main things:

Correct the Mesh Gateway ACL Policy implementation via consul-k8s-control-plane so that when AdminPartitions are enabled, proper policies are in-place to account for multi-cluster service-mesh failover scenarios
Update documentation surrounding the configuration of Cluster Peering Failover scenarios and best practices for implementing.

The text was updated successfully, but these errors were encountered:

natemollica-nm added the type/bug Something isn't working label Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul on K8s: Peering Sameness Group Failover with AdminPartitions #4308

Consul on K8s: Peering Sameness Group Failover with AdminPartitions #4308

natemollica-nm commented Sep 5, 2024 •

edited by blake

Loading

Consul on K8s: Peering Sameness Group Failover with AdminPartitions #4308

Consul on K8s: Peering Sameness Group Failover with AdminPartitions #4308

Comments

natemollica-nm commented Sep 5, 2024 • edited by blake Loading

Community Note

Overview of the Issue

Summary of required policy changes

Reproduction Steps

Logs

Expected behavior

Environment details

Additional Context

natemollica-nm commented Sep 5, 2024 •

edited by blake

Loading