Service with custom Endpoints is unreachable in some cases. #3205

robdewit · 2025-02-17T10:31:25Z

What happened:
When I create a Service with custom Endpoints using an address outside the AWS IP ranges, the service is unreachable in some unoccasions. So some pods can reach the target, and some can't. This behavior is table. I've found no link between nodes, sometime a pod on a specific node can connect to the Service and another on the same node can't. If I log in on the node, it can reach the Service both using the Service address and the target address in the Enpoints resource. Doing a request from the Pod directly to the target address also succeeds.

Using tcpdump to trace traffic of a failing Pod just shows outgoing SYN packets. I noticed that successful and failing Pods use IP addresses from different interfaces, so I suspect the problem is linked to the Pod IP address, the Service address and the combination of that with the routing tables per interface.

Attach logs
File name:: eks_i-01840548d3701fc8d_2025-02-17_1018-UTC_0.7.8.tar.gz

What you expected to happen:
Correct connection to external address is in custom Endpoints.

How to reproduce it (as minimally and precisely as possible):
Create a custom Service+Endpoints:

apiVersion: v1
kind: Service
metadata:
  name: TEST
  namespace: ns1
spec:
  ports:
  - name: https
    port: 443
    protocol: TCP
    targetPort: 443
---
apiVersion: v1
kind: Endpoints
metadata:
  name: TEST
  namespace: ns1
subsets:
- addresses:
  - ip: EXTERNAL_ADDRESS
  ports:
  - port: 443
    protocol: TCP
    name: https

Fire up some Deployment or Daemonset and try to curl to the Service. Some of them will succeed, some of them will fail.

Anything else we need to know?:

iptables config created by kube-proxy is identical on all nodes.
there are no blocking Securitygroups or routing acls, also proven by the fact that some Pods succeed and connection directly from the nodes succeed.
As nodes are running longer, the problem seems to increase.

Environment:
CNI v1.19.2-eksbuild.5
AMI amazon-eks-node-al2023-x86_64-standard-1.32-v20250203

robdewit · 2025-02-18T14:06:22Z

Solving my own issue, but it was kind of hidden in documentation. Leaving this part for other to find:

Apparently in an environment a AWS Direct Connect setup with Kubernetes pods living in private subnets that are also routed to a NAT gateway, then we need to run

kubectl set env daemonset -n kube-system aws-node AWS_VPC_K8S_CNI_EXTERNALSNAT=true

As documented here: https://docs.aws.amazon.com/eks/latest/userguide/external-snat.html

As we have direct routes to our Direct Connect network and Pods were perfectly capable of connecting to real external addresses, I did not think the part "Enable outbound internet access for Pods" applied to our setup. Especially because the Pods that were assoicated with addresses on the primary interface could connect just fine.

robdewit added the bug label Feb 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service with custom Endpoints is unreachable in some cases. #3205

Service with custom Endpoints is unreachable in some cases. #3205

robdewit commented Feb 17, 2025

robdewit commented Feb 18, 2025

Service with custom Endpoints is unreachable in some cases. #3205

Service with custom Endpoints is unreachable in some cases. #3205

Comments

robdewit commented Feb 17, 2025

robdewit commented Feb 18, 2025