Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mizar][local][Scale-out][2TPx2RPx2worker] Intermittently failed to connect to tenant's service: Connection timed out #1410

Closed
q131172019 opened this issue Mar 14, 2022 · 6 comments
Assignees

Comments

@q131172019
Copy link
Collaborator

q131172019 commented Mar 14, 2022

What happened:

In terms of test plan at https://github.com/CentaurusInfra/arktos/wiki/Mizar-Arktos-Integration-Release-2022-0130-Test-Plan, when test the connectivity to the tenant bbb's services, the error "failed to connect to <service-name.ns.svc,cluster.local> port 80: Connection timed out" intermittently happens during for-loop tests in 11 rounds to connect to the following tenant bbb's three services.

http://service-bbb.default.svc.cluster.local
http://service-bbb-ns-bbb.ns-bbb.svc.cluster.local
http://service-1bbb-ns-1bbb.ns-1bbb.svc.cluster.local

Error:

curl: (28) Failed to connect to service-bbb-ns-bbb.ns-bbb.svc.cluster.local port 80: Connection timed out
curl: (28) Failed to connect to service-1bbb-ns-1bbb.ns-1bbb.svc.cluster.local port 80: Connection timed out
curl: (28) Failed to connect to service-bbb.default.svc.cluster.local port 80: Connection timed out

What you expected to happen:
The connectivity to tenant's service always works during for-loop tests.

How to reproduce it (as minimally and precisely as possible):
The environment of Local Scale-OUT 2 TP X 2 RP x 2 worker with cniplugin=mizar:

TP1: ip-172-31-4-135
TP2: ip-172-31-1-189

RP1: ip-172-31-2-248
RP1-Worker1: ip-172-31-8-138
RP1-Worker2: ip-172-31-3-8

RP2: ip-172-31-9-41
RP2-Worker1: ip-172-31-10-32
RP2-Worker2: ip-172-31-3-16

  1. Create tenant bbb
cat ~/TMP/mizar/tenant-bbb.yaml
apiVersion: v1
kind: Tenant
metadata:
  name: bbb
spec:
  storageClusterId: "1"
./cluster/kubectl.sh apply -f ~/TMP/mizar/tenant-bbb.yaml
  1. Create namespaces and services
cat ~/TMP/mizar/service.bbb.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: ns-bbb
  tenant: bbb
---
apiVersion: v1
kind: Namespace
metadata:
  name: ns-1bbb
  tenant: bbb
---
apiVersion: v1
kind: Service
metadata:
  name: service-bbb
  tenant: bbb
spec:
  selector:
    app: nginx-bbb
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: service-bbb-ns-bbb
  namespace: ns-bbb
  tenant: bbb
spec:
  selector:
    app: nginx-bbb-ns-bbb
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: service-1bbb-ns-1bbb
  namespace: ns-1bbb
  tenant: bbb
spec:
  selector:
    app: nginx-1bbb-ns-1bbb
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80

./cluster/kubectl.sh apply -f ~/TMP/mizar/service.bbb.yaml
  1. Create three deployments in default namespace and two non-default namespaces
 cat ~/TMP/mizar/deployment-vpc-1-without-annotation.bbb.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: nginx-bbb
  name: ying-nginx
  tenant: bbb
spec:
  replicas: 4
  selector:
    matchLabels:
      app: nginx-bbb
  template:
    metadata:
      labels:
        app: nginx-bbb
    spec:
      containers:
      - args:
        image: nginx
        name: nginx
        ports:
        - containerPort: 80
./cluster/kubectl.sh apply -f  /home/ubuntu/TMP/mizar/deployment-vpc-1-without-annotation.bbb.yaml
cat ~/TMP/mizar/deployment-vpc-1-without-annotation.1bbb.yaml
kind: Deployment
metadata:
  labels:
    app: nginx-1bbb-ns-1bbb
  name: ying-nginx-1bbb
  namespace: ns-1bbb
  tenant: bbb
spec:
  replicas: 4
  selector:
    matchLabels:
      app: nginx-1bbb-ns-1bbb
  template:
    metadata:
      labels:
        app: nginx-1bbb-ns-1bbb
    spec:
      containers:
      - args:
        image: nginx
        name: nginx
        ports:
        - containerPort: 80
./cluster/kubectl.sh apply -f  /home/ubuntu/TMP/mizar/deployment-vpc-1-without-annotation.1bbb.yaml
cat ~/TMP/mizar/deployment-vpc-1-without-annotation.bbb-ns-bbb.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: nginx-bbb-ns-bbb
  name: ying-nginx-ns-bbb
  namespace: ns-bbb
  tenant: bbb
spec:
  replicas: 4
  selector:
    matchLabels:
      app: nginx-bbb-ns-bbb
  template:
    metadata:
      labels:
        app: nginx-bbb-ns-bbb
    spec:
      containers:
      - args:
        image: nginx
        name: nginx
        ports:
        - containerPort: 80
./cluster/kubectl.sh apply -f  /home/ubuntu/TMP/mizar/deployment-vpc-1-without-annotation.bbb-ns-bbb.yaml
  1. Verify the tenant bbb's services, pods, endpoints, networks
./cluster/kubectl.sh get services,pods,endpoints,networks --all-namespaces --tenant bbb -o wide
NAMESPACE     NAME                           TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                  AGE    SELECTOR
default       service/kubernetes-default     ClusterIP   10.0.18.81     <none>        443/TCP                  175m   k8s-app=kubernetes-default
default       service/service-bbb            ClusterIP   10.0.42.103    <none>        80/TCP                   175m   app=nginx-bbb
kube-system   service/kube-dns-default       ClusterIP   10.0.161.194   <none>        53/UDP,53/TCP,9153/TCP   175m   k8s-app=kube-dns-default
ns-1bbb       service/service-1bbb-ns-1bbb   ClusterIP   10.0.14.68     <none>        80/TCP                   175m   app=nginx-1bbb-ns-1bbb
ns-bbb        service/service-bbb-ns-bbb     ClusterIP   10.0.233.224   <none>        80/TCP                   175m   app=nginx-bbb-ns-bbb

NAMESPACE     NAME                                                   HASHKEY               READY   STATUS    RESTARTS   AGE    IP          NODE              NOMINATED NODE   READINESS GATES
default       pod/ying-nginx-75d7ffbdf5-dxsw4                        7801603933432683107   1/1     Running   0          175m   11.1.0.9    ip-172-31-9-41    <none>           <none>
default       pod/ying-nginx-75d7ffbdf5-j5bdv                        7959235677620137315   1/1     Running   0          175m   11.1.0.7    ip-172-31-9-41    <none>           <none>
default       pod/ying-nginx-75d7ffbdf5-szdz8                        8671668936846641207   1/1     Running   0          175m   11.1.0.8    ip-172-31-2-248   <none>           <none>
default       pod/ying-nginx-75d7ffbdf5-trsks                        3990837428738084677   1/1     Running   0          175m   11.1.0.11   ip-172-31-2-248   <none>           <none>
kube-system   pod/coredns-default-ip-172-31-4-135-766ccf6dd4-nhlkf   6494031457945463924   1/1     Running   0          175m   11.1.0.2    ip-172-31-9-41    <none>           <none>
ns-1bbb       pod/ying-nginx-1bbb-7564746584-9lhb9                   4455628064646431817   1/1     Running   0          175m   11.1.0.23   ip-172-31-2-248   <none>           <none>
ns-1bbb       pod/ying-nginx-1bbb-7564746584-clp72                   5334503734615445185   1/1     Running   0          175m   11.1.0.28   ip-172-31-9-41    <none>           <none>
ns-1bbb       pod/ying-nginx-1bbb-7564746584-gpx6h                   7804682202936687398   1/1     Running   0          175m   11.1.0.25   ip-172-31-2-248   <none>           <none>
ns-1bbb       pod/ying-nginx-1bbb-7564746584-k5c9q                   5568712160644925527   1/1     Running   0          175m   11.1.0.27   ip-172-31-9-41    <none>           <none>
ns-bbb        pod/ying-nginx-ns-bbb-d7c94dc8-4ddv8                   3657409591343863453   1/1     Running   0          174m   11.1.0.44   ip-172-31-9-41    <none>           <none>
ns-bbb        pod/ying-nginx-ns-bbb-d7c94dc8-ct52r                   4209897216720530062   1/1     Running   0          174m   11.1.0.40   ip-172-31-2-248   <none>           <none>
ns-bbb        pod/ying-nginx-ns-bbb-d7c94dc8-kjg7f                   4909923852601513393   1/1     Running   0          174m   11.1.0.42   ip-172-31-9-41    <none>           <none>
ns-bbb        pod/ying-nginx-ns-bbb-d7c94dc8-sgv7g                   75157043574428387     1/1     Running   0          174m   11.1.0.38   ip-172-31-2-248   <none>           <none>

NAMESPACE     NAME                             ENDPOINTS                                            AGE    SERVICEGROUPID
default       endpoints/service-bbb            11.1.0.11:80,11.1.0.7:80,11.1.0.8:80 + 1 more...     175m
kube-system   endpoints/kube-dns-default       11.1.0.2:53,11.1.0.2:53,11.1.0.2:9153                175m
ns-1bbb       endpoints/service-1bbb-ns-1bbb   11.1.0.23:80,11.1.0.25:80,11.1.0.27:80 + 1 more...   175m
ns-bbb        endpoints/service-bbb-ns-bbb     11.1.0.38:80,11.1.0.40:80,11.1.0.42:80 + 1 more...   175m

NAMESPACE   NAME                                   TYPE    VPC                   PHASE   DNS
            network.arktos.futurewei.com/default   mizar   bbb-default-network   Ready   10.0.161.194
  1. Address specific RP server to login specific nginx container using critcrl to test connectivity to three services for 11 rounds
    For example:
ubuntu@ip-172-31-2-248:~/go/src/arktos$ sudo crictl exec -ti 77e73d50009de /bin/bash
root@ying-nginx-75d7ffbdf5-szdz8:/# for i in {1..11}; do   echo "http://service-bbb-ns-bbb.ns-bbb.svc.cluster.local - $i";   echo "=========================================================================";   curl http://service-bbb-ns-bbb.ns-bbb.svc.cluster.local;sleep 2;   echo "";echo "http://service-bbb.default.svc.cluster.local -- $i";   echo "====================================================================";curl http://service-bbb.default.svc.cluster.local;sleep 2; echo "";echo "http://service-1bbb-ns-1bbb.ns-1bbb.svc.cluster.local - $i";   echo "========================================================================";curl http://service-1bbb-ns-1bbb.ns-1bbb.svc.cluster.local;sleep 2;   echo ""; done

Reference: Failure summary in 12 nginx containers

RP1: 3 containers

029c5f5a076cc:   6 failures of 33 tests
5793130c4447a:   3 failures of 33 tests
9efe40c768626:   1 failure of 33 tests

RP2: 1 container

64921759a0452: 1 failure of 33 tests

RP1-Worker1: 1 container

3be590a5b2e22: 3 failures of 33 tests

RP1-Worker2: 2 containers

c5d5ffffb193f: 3 failures of 33 tests
b9c1bd669c687: 3 failures of 33 tests

RP2-Worker1: 3 containers

a75b9eccf9b19: 3 failures of 33 tests
a506f91fdbfe5: 1 failure of 33 tests
e15a3982fe5c3: 3 failures of 33 tests

RP2-Worker2: 2 containers

9b40ab1e47109: 2 failures of 33 tests
7472046c5c92e: 3 failures of 33 tests

Anything else we need to know?:

The error "failed to connect to <service-name.ns.svc,cluster.local> port 80: Connection timed out" intermittently happens in [mizar][local][Scale-out][2TPx2RP] clusters as well.

Environment:

  • Arktos version (use kubectl version):
  • Cloud provider or hardware configuration: AWS
  • OS (e.g: cat /etc/os-release): Ubuntu 20.04
  • Kernel (e.g. uname -a): 5.13.0-1017-aws
  • Install tools: Micar local Scale-out 2TPx2RPx2worker
  • Network plugin and version (if this is a network-related bug):
  • Others:
@q131172019
Copy link
Collaborator Author

q131172019 commented Mar 14, 2022

In [mizar][local][Scale-out][2TPx2RPx2worker] AWS Ubuntu20.04 environment, reproduced the issue by two enigineers.

curl 10.0.12.172(service IP of service) -- 4 failures of 20 tests

for-loop test:
for i in {1..20}; do echo "http://10.0.12.172 - $i"; echo "========================================================================="; curl http://10.0.12.172;sleep 2; echo ""; done

service has 4 pods:
11.1.0.42 : 0 failure of 20 tests
11.1.0.43 : 0 failure of 20 tests
11.1.0.38 : 0 failure of 20 tests
11.1.0.40 : 0 failure of 20 tests

@h-w-chen
Copy link
Collaborator

h-w-chen commented Mar 14, 2022

same symptom was reproduced on kube-up env (2TP, 2RP, each RP w/ 2 workers):
curl by pod IP works well;
curl by service IP presented intermittent "Connection timed out" issue.

@vinaykul
Copy link
Member

@phudtran If you have the time, can you take a look?

@phudtran
Copy link

phudtran commented Mar 16, 2022

I believe it is related to this issue that @Hong-Chang looked into a while back.

@q131172019
Copy link
Collaborator Author

q131172019 commented Mar 17, 2022

Hong and I narrowed down the issue in local 2x2x2 scale-out development environment with AWS Ubuntu20.04 - when deployment contained 1 replica and started 1 pod, the service mapped 1 pod, we found when login one pod to connect to service which mapped this pod itself, connectivity was 100% hanging.

ubuntu@ip-172-31-4-135:~/go/src/arktos$ ./cluster/kubectl.sh get services,pods,endpoints,networks --all-namespaces --tenant bbb -o wide
NAMESPACE     NAME                           TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                  AGE   SELECTOR
default       service/kubernetes-default     ClusterIP   10.0.216.149   <none>        443/TCP                  82s   k8s-app=kubernetes-default
default       service/service-bbb            ClusterIP   10.0.104.1     <none>        80/TCP                   73s   app=nginx-bbb
kube-system   service/kube-dns-default       ClusterIP   10.0.42.255    <none>        53/UDP,53/TCP,9153/TCP   82s   k8s-app=kube-dns-default
ns-1bbb       service/service-1bbb-ns-1bbb   ClusterIP   10.0.217.56    <none>        80/TCP                   73s   app=nginx-1bbb-ns-1bbb
ns-bbb        service/service-bbb-ns-bbb     ClusterIP   10.0.228.105   <none>        80/TCP                   73s   app=nginx-bbb-ns-bbb

NAMESPACE     NAME                                                   HASHKEY               READY   STATUS    RESTARTS   AGE   IP          NODE              N          OMINATED NODE   READINESS GATES
default       pod/ying-nginx-75d7ffbdf5-2b865                        470882719710773571    1/1     Running   0          33s   11.1.0.6    ip-172-31-10-32   <          none>           <none>
kube-system   pod/coredns-default-ip-172-31-4-135-766ccf6dd4-9p98m   4332501775353507365   1/1     Running   0          82s   11.1.0.2    ip-172-31-2-248   <          none>           <none>
ns-1bbb       pod/ying-nginx-1bbb-7564746584-lbkl8                   9106484152452311463   1/1     Running   0          13s   11.1.0.16   ip-172-31-9-41    <          none>           <none>
ns-bbb        pod/ying-nginx-ns-bbb-d7c94dc8-z8lvv                   6774781819657113760   1/1     Running   0          22s   11.1.0.10   ip-172-31-8-138   <          none>           <none>

NAMESPACE     NAME                             ENDPOINTS                               AGE   SERVICEGROUPID
default       endpoints/service-bbb            11.1.0.6:80                             73s
kube-system   endpoints/kube-dns-default       11.1.0.2:53,11.1.0.2:53,11.1.0.2:9153   82s
ns-1bbb       endpoints/service-1bbb-ns-1bbb   11.1.0.16:80                            73s
ns-bbb        endpoints/service-bbb-ns-bbb     11.1.0.10:80                            73s

NAMESPACE   NAME                                   TYPE    VPC                   PHASE   DNS
            network.arktos.futurewei.com/default   mizar   bbb-default-network   Ready   10.0.42.255

10.0.217.56 (service-1bbb-ns-1bbb.ns-1bbb.svc.cluster.local) --- pod/ying-nginx-1bbb-7564746584-lbkl8
10.0.104.1 (service-bbb.default.svc.cluster.local) -- pod/ying-nginx-75d7ffbdf5-2b865
10.0.228.105(service-bbb-ns-bbb.ns-bbb.default.svc.cluster.local) - pod/ying-nginx-ns-bbb-d7c94dc8-z8lvv

When we login the pod (ns-1bbb pod/ying-nginx-1bbb-7564746584-lbkl8 9106484152452311463 1/1 Running 0 19s 11.1.0.16 ip-172-31-9-41 ), it was failed to connect to service IP 10.0.217.56 (service-1bbb-ns-1bbb.ns-1bbb.svc.cluster.local). BUT when connect to other two service IPs - 10.0.104.1 and 10.0.228.105 for 20 times, respectively, the connections were all successful.

root@ying-nginx-1bbb-7564746584-lbkl8:/# curl 10.0.217.56
curl: (28) Failed to connect to 10.0.217.56 port 80: Connection timed out
root@ying-nginx-1bbb-7564746584-lbkl8:/# curl 10.0.104.1
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

root@ying-nginx-1bbb-7564746584-lbkl8:/# curl 10.0.228.105
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

@Hong-Chang
Copy link
Contributor

I did some further investigation. This issue is not related with checksum calculation. I don't see any log mentioned checksum error.

Later I figured out it's a scenario that missed consideration by mizar. The scenario is: a pod can connect to a service, while this service pointing back to the exact same pod.

Further more, this issue shall be in the mizar all the time, but we just haven't noticed it. It's not related with Arktos. I repro'd the exact same issue with kubernetes + mizar. So this issue is not proper to be in arktos repo.

I created a new issue in mizar repro and close this one. The new issue is CentaurusInfra/mizar#648 Connectivity issue when pod A send traffic to service which points to pod A back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants