Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods with security groups cannot resolve/too slow to resolve domain names #3126

Open
uyilmaz opened this issue Nov 28, 2024 · 7 comments
Open

Comments

@uyilmaz
Copy link

uyilmaz commented Nov 28, 2024

What happened:

I have this following setup, using "security groups for pods" and "prefix delegation":

  • POD_SECURITY_GROUP_ENFORCING_MODE is set to "standard"
  • ENABLE_PREFIX_DELEGATION is set to "true"
  • AWS_VPC_K8S_CNI_EXTERNALSNAT is set to "false"
  • node type is "r6g.medium"
  • pods that have a security group assigned also have a network policy assigned
  • 2 nodes

Pods without a security group work normally, however, pods with a security group resolve DNS addresses so slowly that I first thought they couldn't resolve at all, but after numerous retries I get a few successful resolves. For example curl example.com times out with "could not resolve address" most of the time.

Environment:

  • Kubernetes version (use kubectl version): v1.31.2-eks-7f9249a
  • CNI Version: v1.18.6-eksbuild.1 (aws-network-policy-agent:v1.1.4-eksbuild.1)
  • OS (e.g: cat /etc/os-release): Amazon Linux 2
  • Kernel (e.g. uname -a): Linux ip-x-x-xxx-xx.ap-northeast-1.compute.internal x.xx.xxx-xxx.xxx.amzn2.aarch64 #1 SMP Tue Oct 22 16:38:25 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

I have another cluster with the following setup that does not have the same problem (differences in bold):

  • POD_SECURITY_GROUP_ENFORCING_MODE is set to "standard"
  • ENABLE_PREFIX_DELEGATION is set to "false"
  • AWS_VPC_K8S_CNI_EXTERNALSNAT is set to "false"
  • node type is "m6g.xlarge"
  • pods that have a security group assigned also have a network policy assigned
  • 1 node

Environment:

  • Kubernetes version (use kubectl version): v1.28.15-eks-7f9249a
  • CNI Version: v1.15.4-eksbuild.1
  • OS (e.g: cat /etc/os-release): Amazon Linux 2
@yash97
Copy link
Contributor

yash97 commented Dec 4, 2024

does this only happen during intial phase of pod creation or is it consistent? Would really help if you generate a log bundle and send us email provided here https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#collecting-node-level-tech-support-bundle-for-offline-troubleshooting. Thanks

@uyilmaz
Copy link
Author

uyilmaz commented Dec 4, 2024

Thanks for answering,

does this only happen during intial phase of pod creation or is it consistent

it is consistent, as in, it doesn't get better over time. Pods that have security groups attached run for days and still they are very slow to resolve dns.

I'll try to send the log bundle.

Edit: I've sent the two bundles for two nodes to the email address provided in the guide.

@orsenthil
Copy link
Member

@uyilmaz - Did you notice this change after any upgrade or did it happen after any timeframe (Note VPC RC controller is managed by AWS side), so if this problem was not present previously, but started showing up now, do you know any timeline?

@yash97
Copy link
Contributor

yash97 commented Dec 4, 2024

Can you also share network policy which gets applied to pods with security group.

@uyilmaz
Copy link
Author

uyilmaz commented Dec 5, 2024

@uyilmaz - Did you notice this change after any upgrade or did it happen after any timeframe (Note VPC RC controller is managed by AWS side), so if this problem was not present previously, but started showing up now, do you know any timeline?

I first updated the cluster from version 1.28 to 1.31 and CNI version from v1.15.4-eksbuild.1 to v1.18.6-eksbuild.1(aws-network-policy-agent:v1.1.4-eksbuild.1). At that time it was using m6g.xlarge as the node and it didn't have this problem.

Then I changed the node type to r6g.medium and enabled prefix mode. After that I began to experience the problem. If I delete the securitygrouppolicy of a pod, it starts resolving dns normally.

Can you also share network policy which gets applied to pods with security group.

Here is the network policy of a pod that experiences the problem:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  creationTimestamp: "2024-11-21T07:19:11Z"
  finalizers:
  - networking.k8s.aws/resources
  generation: 5
  name: network-policy-xxx
  namespace: mynamespace
  resourceVersion: "169997767"
  uid: 2f3c82ee-6b96-478a-991e-1d396aeca33e
spec:
  egress:
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 169.254.0.0/16
  ingress:
  - from:
    - podSelector:
        matchLabels:
          mypod: xxx
    - ipBlock:
        cidr: 10.9.196.0/23
    - ipBlock:
        cidr: 10.9.198.0/23
    - ipBlock:
        cidr: 10.9.128.0/20
    - ipBlock:
        cidr: 10.9.0.0/20
  podSelector:
    matchLabels:
      mypod: xxx
  policyTypes:
  - Ingress
  - Egress

The CIDRs in the ingress block are internet facing subnets of the eks cluster, plus a couple subnets that I wanted to allow access from inside the VPC. Like I said before, without the securitygrouppolicy, it works normally.

@uyilmaz
Copy link
Author

uyilmaz commented Jan 9, 2025

I updated my second cluster to 1.31 and the same problem begin to occur there as well. ENABLE_PREFIX_DELEGATION is set to false.

@uyilmaz
Copy link
Author

uyilmaz commented Feb 12, 2025

I think I understood why this happens.

I have two worker nodes in my cluster and 2 coredns pods. Both coredns pods are running on the 1st node. The second node itself can access coredns pods using pod IPs, but pods (with security group attached) on the second node can not. When I added a new rule to worker node security group (eks-remoteAccess-xxx) which allows tcp/udp DNS traffic coming from my pod's security group, it become able to resolve DNS without issue. But I can't keep adding new rules for every new pod that has a different security group, so I don't know what the best solution is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants