[mizar][local][Scale-out][2TPx2RPx2worker] Intermittently failed to connect to tenant's service: Connection timed out #1410

q131172019 · 2022-03-14T05:29:22Z

What happened:

In terms of test plan at https://github.com/CentaurusInfra/arktos/wiki/Mizar-Arktos-Integration-Release-2022-0130-Test-Plan, when test the connectivity to the tenant bbb's services, the error "failed to connect to <service-name.ns.svc,cluster.local> port 80: Connection timed out" intermittently happens during for-loop tests in 11 rounds to connect to the following tenant bbb's three services.

http://service-bbb.default.svc.cluster.local
http://service-bbb-ns-bbb.ns-bbb.svc.cluster.local
http://service-1bbb-ns-1bbb.ns-1bbb.svc.cluster.local

Error:

curl: (28) Failed to connect to service-bbb-ns-bbb.ns-bbb.svc.cluster.local port 80: Connection timed out
curl: (28) Failed to connect to service-1bbb-ns-1bbb.ns-1bbb.svc.cluster.local port 80: Connection timed out
curl: (28) Failed to connect to service-bbb.default.svc.cluster.local port 80: Connection timed out

What you expected to happen:
The connectivity to tenant's service always works during for-loop tests.

How to reproduce it (as minimally and precisely as possible):
The environment of Local Scale-OUT 2 TP X 2 RP x 2 worker with cniplugin=mizar:

TP1: ip-172-31-4-135
TP2: ip-172-31-1-189

RP1: ip-172-31-2-248
RP1-Worker1: ip-172-31-8-138
RP1-Worker2: ip-172-31-3-8

RP2: ip-172-31-9-41
RP2-Worker1: ip-172-31-10-32
RP2-Worker2: ip-172-31-3-16

Create tenant bbb

cat ~/TMP/mizar/tenant-bbb.yaml

apiVersion: v1
kind: Tenant
metadata:
  name: bbb
spec:
  storageClusterId: "1"

./cluster/kubectl.sh apply -f ~/TMP/mizar/tenant-bbb.yaml

Create namespaces and services

cat ~/TMP/mizar/service.bbb.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: ns-bbb
  tenant: bbb
---
apiVersion: v1
kind: Namespace
metadata:
  name: ns-1bbb
  tenant: bbb
---
apiVersion: v1
kind: Service
metadata:
  name: service-bbb
  tenant: bbb
spec:
  selector:
    app: nginx-bbb
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: service-bbb-ns-bbb
  namespace: ns-bbb
  tenant: bbb
spec:
  selector:
    app: nginx-bbb-ns-bbb
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: service-1bbb-ns-1bbb
  namespace: ns-1bbb
  tenant: bbb
spec:
  selector:
    app: nginx-1bbb-ns-1bbb
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80

./cluster/kubectl.sh apply -f ~/TMP/mizar/service.bbb.yaml

Create three deployments in default namespace and two non-default namespaces

 cat ~/TMP/mizar/deployment-vpc-1-without-annotation.bbb.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: nginx-bbb
  name: ying-nginx
  tenant: bbb
spec:
  replicas: 4
  selector:
    matchLabels:
      app: nginx-bbb
  template:
    metadata:
      labels:
        app: nginx-bbb
    spec:
      containers:
      - args:
        image: nginx
        name: nginx
        ports:
        - containerPort: 80

./cluster/kubectl.sh apply -f  /home/ubuntu/TMP/mizar/deployment-vpc-1-without-annotation.bbb.yaml

cat ~/TMP/mizar/deployment-vpc-1-without-annotation.1bbb.yaml

kind: Deployment
metadata:
  labels:
    app: nginx-1bbb-ns-1bbb
  name: ying-nginx-1bbb
  namespace: ns-1bbb
  tenant: bbb
spec:
  replicas: 4
  selector:
    matchLabels:
      app: nginx-1bbb-ns-1bbb
  template:
    metadata:
      labels:
        app: nginx-1bbb-ns-1bbb
    spec:
      containers:
      - args:
        image: nginx
        name: nginx
        ports:
        - containerPort: 80

./cluster/kubectl.sh apply -f  /home/ubuntu/TMP/mizar/deployment-vpc-1-without-annotation.1bbb.yaml

cat ~/TMP/mizar/deployment-vpc-1-without-annotation.bbb-ns-bbb.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: nginx-bbb-ns-bbb
  name: ying-nginx-ns-bbb
  namespace: ns-bbb
  tenant: bbb
spec:
  replicas: 4
  selector:
    matchLabels:
      app: nginx-bbb-ns-bbb
  template:
    metadata:
      labels:
        app: nginx-bbb-ns-bbb
    spec:
      containers:
      - args:
        image: nginx
        name: nginx
        ports:
        - containerPort: 80

./cluster/kubectl.sh apply -f  /home/ubuntu/TMP/mizar/deployment-vpc-1-without-annotation.bbb-ns-bbb.yaml

Verify the tenant bbb's services, pods, endpoints, networks

./cluster/kubectl.sh get services,pods,endpoints,networks --all-namespaces --tenant bbb -o wide

NAMESPACE     NAME                           TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                  AGE    SELECTOR
default       service/kubernetes-default     ClusterIP   10.0.18.81     <none>        443/TCP                  175m   k8s-app=kubernetes-default
default       service/service-bbb            ClusterIP   10.0.42.103    <none>        80/TCP                   175m   app=nginx-bbb
kube-system   service/kube-dns-default       ClusterIP   10.0.161.194   <none>        53/UDP,53/TCP,9153/TCP   175m   k8s-app=kube-dns-default
ns-1bbb       service/service-1bbb-ns-1bbb   ClusterIP   10.0.14.68     <none>        80/TCP                   175m   app=nginx-1bbb-ns-1bbb
ns-bbb        service/service-bbb-ns-bbb     ClusterIP   10.0.233.224   <none>        80/TCP                   175m   app=nginx-bbb-ns-bbb

NAMESPACE     NAME                                                   HASHKEY               READY   STATUS    RESTARTS   AGE    IP          NODE              NOMINATED NODE   READINESS GATES
default       pod/ying-nginx-75d7ffbdf5-dxsw4                        7801603933432683107   1/1     Running   0          175m   11.1.0.9    ip-172-31-9-41    <none>           <none>
default       pod/ying-nginx-75d7ffbdf5-j5bdv                        7959235677620137315   1/1     Running   0          175m   11.1.0.7    ip-172-31-9-41    <none>           <none>
default       pod/ying-nginx-75d7ffbdf5-szdz8                        8671668936846641207   1/1     Running   0          175m   11.1.0.8    ip-172-31-2-248   <none>           <none>
default       pod/ying-nginx-75d7ffbdf5-trsks                        3990837428738084677   1/1     Running   0          175m   11.1.0.11   ip-172-31-2-248   <none>           <none>
kube-system   pod/coredns-default-ip-172-31-4-135-766ccf6dd4-nhlkf   6494031457945463924   1/1     Running   0          175m   11.1.0.2    ip-172-31-9-41    <none>           <none>
ns-1bbb       pod/ying-nginx-1bbb-7564746584-9lhb9                   4455628064646431817   1/1     Running   0          175m   11.1.0.23   ip-172-31-2-248   <none>           <none>
ns-1bbb       pod/ying-nginx-1bbb-7564746584-clp72                   5334503734615445185   1/1     Running   0          175m   11.1.0.28   ip-172-31-9-41    <none>           <none>
ns-1bbb       pod/ying-nginx-1bbb-7564746584-gpx6h                   7804682202936687398   1/1     Running   0          175m   11.1.0.25   ip-172-31-2-248   <none>           <none>
ns-1bbb       pod/ying-nginx-1bbb-7564746584-k5c9q                   5568712160644925527   1/1     Running   0          175m   11.1.0.27   ip-172-31-9-41    <none>           <none>
ns-bbb        pod/ying-nginx-ns-bbb-d7c94dc8-4ddv8                   3657409591343863453   1/1     Running   0          174m   11.1.0.44   ip-172-31-9-41    <none>           <none>
ns-bbb        pod/ying-nginx-ns-bbb-d7c94dc8-ct52r                   4209897216720530062   1/1     Running   0          174m   11.1.0.40   ip-172-31-2-248   <none>           <none>
ns-bbb        pod/ying-nginx-ns-bbb-d7c94dc8-kjg7f                   4909923852601513393   1/1     Running   0          174m   11.1.0.42   ip-172-31-9-41    <none>           <none>
ns-bbb        pod/ying-nginx-ns-bbb-d7c94dc8-sgv7g                   75157043574428387     1/1     Running   0          174m   11.1.0.38   ip-172-31-2-248   <none>           <none>

NAMESPACE     NAME                             ENDPOINTS                                            AGE    SERVICEGROUPID
default       endpoints/service-bbb            11.1.0.11:80,11.1.0.7:80,11.1.0.8:80 + 1 more...     175m
kube-system   endpoints/kube-dns-default       11.1.0.2:53,11.1.0.2:53,11.1.0.2:9153                175m
ns-1bbb       endpoints/service-1bbb-ns-1bbb   11.1.0.23:80,11.1.0.25:80,11.1.0.27:80 + 1 more...   175m
ns-bbb        endpoints/service-bbb-ns-bbb     11.1.0.38:80,11.1.0.40:80,11.1.0.42:80 + 1 more...   175m

NAMESPACE   NAME                                   TYPE    VPC                   PHASE   DNS
            network.arktos.futurewei.com/default   mizar   bbb-default-network   Ready   10.0.161.194

Address specific RP server to login specific nginx container using critcrl to test connectivity to three services for 11 rounds
For example:

ubuntu@ip-172-31-2-248:~/go/src/arktos$ sudo crictl exec -ti 77e73d50009de /bin/bash

root@ying-nginx-75d7ffbdf5-szdz8:/# for i in {1..11}; do   echo "http://service-bbb-ns-bbb.ns-bbb.svc.cluster.local - $i";   echo "=========================================================================";   curl http://service-bbb-ns-bbb.ns-bbb.svc.cluster.local;sleep 2;   echo "";echo "http://service-bbb.default.svc.cluster.local -- $i";   echo "====================================================================";curl http://service-bbb.default.svc.cluster.local;sleep 2; echo "";echo "http://service-1bbb-ns-1bbb.ns-1bbb.svc.cluster.local - $i";   echo "========================================================================";curl http://service-1bbb-ns-1bbb.ns-1bbb.svc.cluster.local;sleep 2;   echo ""; done

Reference: Failure summary in 12 nginx containers

RP1: 3 containers

029c5f5a076cc:   6 failures of 33 tests
5793130c4447a:   3 failures of 33 tests
9efe40c768626:   1 failure of 33 tests

RP2: 1 container

64921759a0452: 1 failure of 33 tests

RP1-Worker1: 1 container

3be590a5b2e22: 3 failures of 33 tests

RP1-Worker2: 2 containers

c5d5ffffb193f: 3 failures of 33 tests
b9c1bd669c687: 3 failures of 33 tests

RP2-Worker1: 3 containers

a75b9eccf9b19: 3 failures of 33 tests
a506f91fdbfe5: 1 failure of 33 tests
e15a3982fe5c3: 3 failures of 33 tests

RP2-Worker2: 2 containers

9b40ab1e47109: 2 failures of 33 tests
7472046c5c92e: 3 failures of 33 tests

Anything else we need to know?:

The error "failed to connect to <service-name.ns.svc,cluster.local> port 80: Connection timed out" intermittently happens in [mizar][local][Scale-out][2TPx2RP] clusters as well.

Environment:

Arktos version (use kubectl version):
Cloud provider or hardware configuration: AWS
OS (e.g: cat /etc/os-release): Ubuntu 20.04
Kernel (e.g. uname -a): 5.13.0-1017-aws
Install tools: Micar local Scale-out 2TPx2RPx2worker
Network plugin and version (if this is a network-related bug):
Others:

The text was updated successfully, but these errors were encountered:

q131172019 · 2022-03-14T19:00:14Z

In [mizar][local][Scale-out][2TPx2RPx2worker] AWS Ubuntu20.04 environment, reproduced the issue by two enigineers.

curl 10.0.12.172(service IP of service) -- 4 failures of 20 tests

for-loop test:
for i in {1..20}; do echo "http://10.0.12.172 - $i"; echo "========================================================================="; curl http://10.0.12.172;sleep 2; echo ""; done

service has 4 pods:
11.1.0.42 : 0 failure of 20 tests
11.1.0.43 : 0 failure of 20 tests
11.1.0.38 : 0 failure of 20 tests
11.1.0.40 : 0 failure of 20 tests

h-w-chen · 2022-03-14T19:58:59Z

same symptom was reproduced on kube-up env (2TP, 2RP, each RP w/ 2 workers):
curl by pod IP works well;
curl by service IP presented intermittent "Connection timed out" issue.

vinaykul · 2022-03-16T17:39:23Z

@phudtran If you have the time, can you take a look?

phudtran · 2022-03-16T17:59:19Z

I believe it is related to this issue that @Hong-Chang looked into a while back.

q131172019 · 2022-03-17T19:45:09Z

Hong and I narrowed down the issue in local 2x2x2 scale-out development environment with AWS Ubuntu20.04 - when deployment contained 1 replica and started 1 pod, the service mapped 1 pod, we found when login one pod to connect to service which mapped this pod itself, connectivity was 100% hanging.

ubuntu@ip-172-31-4-135:~/go/src/arktos$ ./cluster/kubectl.sh get services,pods,endpoints,networks --all-namespaces --tenant bbb -o wide
NAMESPACE     NAME                           TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                  AGE   SELECTOR
default       service/kubernetes-default     ClusterIP   10.0.216.149   <none>        443/TCP                  82s   k8s-app=kubernetes-default
default       service/service-bbb            ClusterIP   10.0.104.1     <none>        80/TCP                   73s   app=nginx-bbb
kube-system   service/kube-dns-default       ClusterIP   10.0.42.255    <none>        53/UDP,53/TCP,9153/TCP   82s   k8s-app=kube-dns-default
ns-1bbb       service/service-1bbb-ns-1bbb   ClusterIP   10.0.217.56    <none>        80/TCP                   73s   app=nginx-1bbb-ns-1bbb
ns-bbb        service/service-bbb-ns-bbb     ClusterIP   10.0.228.105   <none>        80/TCP                   73s   app=nginx-bbb-ns-bbb

NAMESPACE     NAME                                                   HASHKEY               READY   STATUS    RESTARTS   AGE   IP          NODE              N          OMINATED NODE   READINESS GATES
default       pod/ying-nginx-75d7ffbdf5-2b865                        470882719710773571    1/1     Running   0          33s   11.1.0.6    ip-172-31-10-32   <          none>           <none>
kube-system   pod/coredns-default-ip-172-31-4-135-766ccf6dd4-9p98m   4332501775353507365   1/1     Running   0          82s   11.1.0.2    ip-172-31-2-248   <          none>           <none>
ns-1bbb       pod/ying-nginx-1bbb-7564746584-lbkl8                   9106484152452311463   1/1     Running   0          13s   11.1.0.16   ip-172-31-9-41    <          none>           <none>
ns-bbb        pod/ying-nginx-ns-bbb-d7c94dc8-z8lvv                   6774781819657113760   1/1     Running   0          22s   11.1.0.10   ip-172-31-8-138   <          none>           <none>

NAMESPACE     NAME                             ENDPOINTS                               AGE   SERVICEGROUPID
default       endpoints/service-bbb            11.1.0.6:80                             73s
kube-system   endpoints/kube-dns-default       11.1.0.2:53,11.1.0.2:53,11.1.0.2:9153   82s
ns-1bbb       endpoints/service-1bbb-ns-1bbb   11.1.0.16:80                            73s
ns-bbb        endpoints/service-bbb-ns-bbb     11.1.0.10:80                            73s

NAMESPACE   NAME                                   TYPE    VPC                   PHASE   DNS
            network.arktos.futurewei.com/default   mizar   bbb-default-network   Ready   10.0.42.255

10.0.217.56 (service-1bbb-ns-1bbb.ns-1bbb.svc.cluster.local) --- pod/ying-nginx-1bbb-7564746584-lbkl8
10.0.104.1 (service-bbb.default.svc.cluster.local) -- pod/ying-nginx-75d7ffbdf5-2b865
10.0.228.105(service-bbb-ns-bbb.ns-bbb.default.svc.cluster.local) - pod/ying-nginx-ns-bbb-d7c94dc8-z8lvv

When we login the pod (ns-1bbb pod/ying-nginx-1bbb-7564746584-lbkl8 9106484152452311463 1/1 Running 0 19s 11.1.0.16 ip-172-31-9-41 ), it was failed to connect to service IP 10.0.217.56 (service-1bbb-ns-1bbb.ns-1bbb.svc.cluster.local). BUT when connect to other two service IPs - 10.0.104.1 and 10.0.228.105 for 20 times, respectively, the connections were all successful.

root@ying-nginx-1bbb-7564746584-lbkl8:/# curl 10.0.217.56
curl: (28) Failed to connect to 10.0.217.56 port 80: Connection timed out

root@ying-nginx-1bbb-7564746584-lbkl8:/# curl 10.0.104.1
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

root@ying-nginx-1bbb-7564746584-lbkl8:/# curl 10.0.228.105
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

Hong-Chang · 2022-03-18T05:31:06Z

I did some further investigation. This issue is not related with checksum calculation. I don't see any log mentioned checksum error.

Later I figured out it's a scenario that missed consideration by mizar. The scenario is: a pod can connect to a service, while this service pointing back to the exact same pod.

Further more, this issue shall be in the mizar all the time, but we just haven't noticed it. It's not related with Arktos. I repro'd the exact same issue with kubernetes + mizar. So this issue is not proper to be in arktos repo.

I created a new issue in mizar repro and close this one. The new issue is CentaurusInfra/mizar#648 Connectivity issue when pod A send traffic to service which points to pod A back.

Sindica assigned h-w-chen and q131172019 Mar 14, 2022

vinaykul assigned phudtran Mar 16, 2022

Hong-Chang closed this as completed Mar 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mizar][local][Scale-out][2TPx2RPx2worker] Intermittently failed to connect to tenant's service: Connection timed out #1410

[mizar][local][Scale-out][2TPx2RPx2worker] Intermittently failed to connect to tenant's service: Connection timed out #1410

q131172019 commented Mar 14, 2022 •

edited

Loading

q131172019 commented Mar 14, 2022 •

edited

Loading

h-w-chen commented Mar 14, 2022 •

edited

Loading

vinaykul commented Mar 16, 2022

phudtran commented Mar 16, 2022 •

edited

Loading

q131172019 commented Mar 17, 2022 •

edited

Loading

Hong-Chang commented Mar 18, 2022

[mizar][local][Scale-out][2TPx2RPx2worker] Intermittently failed to connect to tenant's service: Connection timed out #1410

[mizar][local][Scale-out][2TPx2RPx2worker] Intermittently failed to connect to tenant's service: Connection timed out #1410

Comments

q131172019 commented Mar 14, 2022 • edited Loading

q131172019 commented Mar 14, 2022 • edited Loading

h-w-chen commented Mar 14, 2022 • edited Loading

vinaykul commented Mar 16, 2022

phudtran commented Mar 16, 2022 • edited Loading

q131172019 commented Mar 17, 2022 • edited Loading

Hong-Chang commented Mar 18, 2022

q131172019 commented Mar 14, 2022 •

edited

Loading

q131172019 commented Mar 14, 2022 •

edited

Loading

h-w-chen commented Mar 14, 2022 •

edited

Loading

phudtran commented Mar 16, 2022 •

edited

Loading

q131172019 commented Mar 17, 2022 •

edited

Loading