Pods with security groups cannot resolve/too slow to resolve domain names #3126

uyilmaz · 2024-11-28T15:06:20Z

What happened:

I have this following setup, using "security groups for pods" and "prefix delegation":

POD_SECURITY_GROUP_ENFORCING_MODE is set to "standard"
ENABLE_PREFIX_DELEGATION is set to "true"
AWS_VPC_K8S_CNI_EXTERNALSNAT is set to "false"
node type is "r6g.medium"
pods that have a security group assigned also have a network policy assigned
2 nodes

Pods without a security group work normally, however, pods with a security group resolve DNS addresses so slowly that I first thought they couldn't resolve at all, but after numerous retries I get a few successful resolves. For example curl example.com times out with "could not resolve address" most of the time.

Environment:

Kubernetes version (use kubectl version): v1.31.2-eks-7f9249a
CNI Version: v1.18.6-eksbuild.1 (aws-network-policy-agent:v1.1.4-eksbuild.1)
OS (e.g: cat /etc/os-release): Amazon Linux 2
Kernel (e.g. uname -a): Linux ip-x-x-xxx-xx.ap-northeast-1.compute.internal x.xx.xxx-xxx.xxx.amzn2.aarch64 #1 SMP Tue Oct 22 16:38:25 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

I have another cluster with the following setup that does not have the same problem (differences in bold):

POD_SECURITY_GROUP_ENFORCING_MODE is set to "standard"
ENABLE_PREFIX_DELEGATION is set to "false"
AWS_VPC_K8S_CNI_EXTERNALSNAT is set to "false"
node type is "m6g.xlarge"
pods that have a security group assigned also have a network policy assigned
1 node

Environment:

Kubernetes version (use kubectl version): v1.28.15-eks-7f9249a
CNI Version: v1.15.4-eksbuild.1
OS (e.g: cat /etc/os-release): Amazon Linux 2

The text was updated successfully, but these errors were encountered:

yash97 · 2024-12-04T00:06:54Z

does this only happen during intial phase of pod creation or is it consistent? Would really help if you generate a log bundle and send us email provided here https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#collecting-node-level-tech-support-bundle-for-offline-troubleshooting. Thanks

uyilmaz · 2024-12-04T11:43:03Z

Thanks for answering,

does this only happen during intial phase of pod creation or is it consistent

it is consistent, as in, it doesn't get better over time. Pods that have security groups attached run for days and still they are very slow to resolve dns.

I'll try to send the log bundle.

Edit: I've sent the two bundles for two nodes to the email address provided in the guide.

orsenthil · 2024-12-04T18:30:35Z

@uyilmaz - Did you notice this change after any upgrade or did it happen after any timeframe (Note VPC RC controller is managed by AWS side), so if this problem was not present previously, but started showing up now, do you know any timeline?

yash97 · 2024-12-04T21:01:20Z

Can you also share network policy which gets applied to pods with security group.

uyilmaz · 2024-12-05T06:29:42Z

@uyilmaz - Did you notice this change after any upgrade or did it happen after any timeframe (Note VPC RC controller is managed by AWS side), so if this problem was not present previously, but started showing up now, do you know any timeline?

I first updated the cluster from version 1.28 to 1.31 and CNI version from v1.15.4-eksbuild.1 to v1.18.6-eksbuild.1(aws-network-policy-agent:v1.1.4-eksbuild.1). At that time it was using m6g.xlarge as the node and it didn't have this problem.

Then I changed the node type to r6g.medium and enabled prefix mode. After that I began to experience the problem. If I delete the securitygrouppolicy of a pod, it starts resolving dns normally.

Can you also share network policy which gets applied to pods with security group.

Here is the network policy of a pod that experiences the problem:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  creationTimestamp: "2024-11-21T07:19:11Z"
  finalizers:
  - networking.k8s.aws/resources
  generation: 5
  name: network-policy-xxx
  namespace: mynamespace
  resourceVersion: "169997767"
  uid: 2f3c82ee-6b96-478a-991e-1d396aeca33e
spec:
  egress:
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 169.254.0.0/16
  ingress:
  - from:
    - podSelector:
        matchLabels:
          mypod: xxx
    - ipBlock:
        cidr: 10.9.196.0/23
    - ipBlock:
        cidr: 10.9.198.0/23
    - ipBlock:
        cidr: 10.9.128.0/20
    - ipBlock:
        cidr: 10.9.0.0/20
  podSelector:
    matchLabels:
      mypod: xxx
  policyTypes:
  - Ingress
  - Egress

The CIDRs in the ingress block are internet facing subnets of the eks cluster, plus a couple subnets that I wanted to allow access from inside the VPC. Like I said before, without the securitygrouppolicy, it works normally.

uyilmaz · 2025-01-09T09:52:12Z

I updated my second cluster to 1.31 and the same problem begin to occur there as well. ENABLE_PREFIX_DELEGATION is set to false.

uyilmaz · 2025-02-12T06:19:36Z

I think I understood why this happens.

I have two worker nodes in my cluster and 2 coredns pods. Both coredns pods are running on the 1st node. The second node itself can access coredns pods using pod IPs, but pods (with security group attached) on the second node can not. When I added a new rule to worker node security group (eks-remoteAccess-xxx) which allows tcp/udp DNS traffic coming from my pod's security group, it become able to resolve DNS without issue. But I can't keep adding new rules for every new pod that has a different security group, so I don't know what the best solution is.

uyilmaz added needs investigation question labels Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods with security groups cannot resolve/too slow to resolve domain names #3126

Pods with security groups cannot resolve/too slow to resolve domain names #3126

uyilmaz commented Nov 28, 2024 •

edited

Loading

yash97 commented Dec 4, 2024

uyilmaz commented Dec 4, 2024 •

edited

Loading

orsenthil commented Dec 4, 2024

yash97 commented Dec 4, 2024

uyilmaz commented Dec 5, 2024

uyilmaz commented Jan 9, 2025

uyilmaz commented Feb 12, 2025 •

edited

Loading

Pods with security groups cannot resolve/too slow to resolve domain names #3126

Pods with security groups cannot resolve/too slow to resolve domain names #3126

Comments

uyilmaz commented Nov 28, 2024 • edited Loading

yash97 commented Dec 4, 2024

uyilmaz commented Dec 4, 2024 • edited Loading

orsenthil commented Dec 4, 2024

yash97 commented Dec 4, 2024

uyilmaz commented Dec 5, 2024

uyilmaz commented Jan 9, 2025

uyilmaz commented Feb 12, 2025 • edited Loading

uyilmaz commented Nov 28, 2024 •

edited

Loading

uyilmaz commented Dec 4, 2024 •

edited

Loading

uyilmaz commented Feb 12, 2025 •

edited

Loading