Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]restart edge node workload does not start successfully Under the condition of no connection with apiServer #351

Closed
lclxg opened this issue Jun 11, 2021 · 9 comments
Labels

Comments

@lclxg
Copy link

lclxg commented Jun 11, 2021

What happened:shutdown the master node and restart the edge node but the workload is not running status

What you expected to happen:workload is running status when edge node restarted

How to reproduce it (as minimally and precisely as possible):shutdown master node and restart edge node

Anything else we need to know?:

Environment:

  • OpenYurt version: V0.4
  • Kubernetes version (use kubectl version): 1.20
  • OS (e.g: cat /etc/os-release):centos8.3
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

others
/kind question

@lclxg lclxg added the kind/question kind/question label Jun 11, 2021
@rambohe-ch
Copy link
Member

@lclxg Thank you for filing issue.
Maybe we need more information to get root cause of this question.

  1. none of pods are running, or some pods(for example yurthub) are running.
  2. would you like to upload detail logs of kubelet in order to locate the reason why can't pod be started?

@lclxg
Copy link
Author

lclxg commented Jun 12, 2021

yurthub is running

@lclxg
Copy link
Author

lclxg commented Jun 12, 2021

kubelet logs as follow:
6月 12 14:06:33 m-k8s5 kubelet[1870]: I0612 14:06:33.130471 1870 plugin_manager.go:112] The desired_state_of_world populator (plugin watcher) starts
6月 12 14:06:33 m-k8s5 kubelet[1870]: I0612 14:06:33.130487 1870 plugin_manager.go:114] Starting Kubelet Plugin Manager
6月 12 14:06:33 m-k8s5 kubelet[1870]: E0612 14:06:33.130495 1870 eviction_manager.go:260] eviction manager: failed to get summary stats: failed to get node info: node "m-k8s5" not found
6月 12 14:06:33 m-k8s5 kubelet[1870]: W0612 14:06:33.130916 1870 docker_sandbox.go:402] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "mytomcat-57c6db8849-c6zhp_lxg": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "c52fdd43b7108a348b7523bfe768e81c846b042573f75c63601b70bb45bb4fcc"
6月 12 14:06:33 m-k8s5 kubelet[1870]: W0612 14:06:33.132710 1870 docker_sandbox.go:402] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "mytomcat-57c6db8849-c6zhp_lxg": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "1117eebb306c2842ad972b9d9e27638da63796c8a01d2206e74cd3ccb17eb395"
6月 12 14:06:33 m-k8s5 kubelet[1870]: W0612 14:06:33.139650 1870 docker_sandbox.go:402] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "ud-test-hangzhou-6vpgr-5f6c9f69b8-9k54p_default": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "e7fe990a598f15433ff11d19d3663ba759acc6ceae5b99793a55f5e6bb7f3b07"
6月 12 14:06:33 m-k8s5 kubelet[1870]: W0612 14:06:33.144441 1870 docker_sandbox.go:402] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "lxgtest-7d5bff8bfc-xn4df_lxgtest": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "5736760e92a0a90d76e2d14a8427991a93a8d1e63be25f06fbff706bb7c2c7bd"
6月 12 14:06:33 m-k8s5 kubelet[1870]: W0612 14:06:33.145981 1870 docker_sandbox.go:402] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "lxgtest-7d5bff8bfc-xn4df_lxgtest": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "f7e30e4b6b5a5f1d3034292ca9f8710d17c19d54606c9191a33df4bcc07f17a7"
6月 12 14:06:33 m-k8s5 kubelet[1870]: E0612 14:06:33.152479 1870 cni.go:387] Error deleting lxg_lxgtest1-7d5bff8bfc-nknnn/325f8719c172ea2cbfeb884f89e2bba5306a93284f69c9aa2d8c55584d533262 from network calico/cni0: error getting ClusterInformation: Get "https://[10.233.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.233.0.1:443: connect: connection refused
6月 12 14:06:33 m-k8s5 kubelet[1870]: W0612 14:06:33.152664 1870 docker_sandbox.go:402] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "ud-test-hangzhou-6vpgr-5f6c9f69b8-6vpxt_default": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "d50d50230013a714ac80fbf78cd391db07fc31caf6ef134d49f114c385e0c8b7"
6月 12 14:06:33 m-k8s5 kubelet[1870]: E0612 14:06:33.153193 1870 remote_runtime.go:143] StopPodSandbox "325f8719c172ea2cbfeb884f89e2bba5306a93284f69c9aa2d8c55584d533262" from runtime service failed: rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "lxgtest1-7d5bff8bfc-nknnn_lxg" network: error getting ClusterInformation: Get "https://[10.233.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.233.0.1:443: connect: connection refused
6月 12 14:06:33 m-k8s5 kubelet[1870]: E0612 14:06:33.153248 1870 kuberuntime_gc.go:165] Failed to remove sandbox "325f8719c172ea2cbfeb884f89e2bba5306a93284f69c9aa2d8c55584d533262": rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "lxgtest1-7d5bff8bfc-nknnn_lxg" network: error getting ClusterInformation: Get "https://[10.233.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.233.0.1:443: connect: connection refused
6月 12 14:06:33 m-k8s5 kubelet[1870]: W0612 14:06:33.154869 1870 cni.go:333] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "b270f530c783cbad47b0c40b69371347a8003091857466c4e72d57dbba94195b"
6月 12 14:06:33 m-k8s5 kubelet[1870]: W0612 14:06:33.159667 1870 docker_sandbox.go:402] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "ud-test-hangzhou-6vpgr-5f6c9f69b8-2n2l7_default": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "958a6325fd38489c88b7584419650c334ea01c56bf3aa58f2b100b8227531dda"
6月 12 14:06:33 m-k8s5 kubelet[1870]: E0612 14:06:33.165561 1870 kubelet.go:2240] node "m-k8s5" not found
6月 12 14:06:33 m-k8s5 kubelet[1870]: E0612 14:06:33.166014 1870 controller.go:144] failed to ensure lease exists, will retry in 400ms, error: Get "http://127.0.0.1:10261/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/m-k8s5?timeout=10s": dial tcp 127.0.0.1:10261: connect: connection refused
6月 12 14:06:33 m-k8s5 kubelet[1870]: W0612 14:06:33.179570 1870 docker_sandbox.go:402] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "ud-test-hangzhou-6vpgr-5f6c9f69b8-h8bkf_default": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "42ac3cefdcbfc0a677e12c29fcf659c77a133c663cad95911334640a13fe0081"
6月 12 14:06:33 m-k8s5 kubelet[1870]: W0612 14:06:33.186253 1870 docker_sandbox.go:402] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "ud-test-hangzhou-6vpgr-5f6c9f69b8-jc4bn_default": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "1bfacdd7e439aaa4b9a3279979829f93560f1e9582df8badf503832544325c91"
6月 12 14:06:33 m-k8s5 kubelet[1870]: W0612 14:06:33.200927 1870 docker_sandbox.go:402] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "ud-test-hangzhou-6vpgr-5f6c9f69b8-tng8q_default": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "28d79cc5155ab23fe628814e55a12493f1236b57e760d2103a0e2a3033ea6ad7"
6月 12 14:06:33 m-k8s5 kubelet[1870]: E0612 14:06:33.201591 1870 cni.go:387] Error deleting default_ud-test-hangzhou-6vpgr-5f6c9f69b8-2zfb8/b270f530c783cbad47b0c40b69371347a8003091857466c4e72d57dbba94195b from network calico/cni0: error getting ClusterInformation: Get "https://[10.233.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.233.0.1:443: connect: connection refused
6月 12 14:06:33 m-k8s5 kubelet[1870]: E0612 14:06:33.204952 1870 remote_runtime.go:143] StopPodSandbox "b270f530c783cbad47b0c40b69371347a8003091857466c4e72d57dbba94195b" from runtime service failed: rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "ud-test-hangzhou-6vpgr-5f6c9f69b8-2zfb8_default" network: error getting ClusterInformation: Get "https://[10.233.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.233.0.1:443: connect: connection refused
6月 12 14:06:33 m-k8s5 kubelet[1870]: E0612 14:06:33.204975 1870 kuberuntime_gc.go:165] Failed to remove sandbox "b270f530c783cbad47b0c40b69371347a8003091857466c4e72d57dbba94195b": rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "ud-test-hangzhou-6vpgr-5f6c9f69b8-2zfb8_default" network: error getting ClusterInformation: Get "https://[10.233.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.233.0.1:443: connect: connection refused
6月 12 14:06:33 m-k8s5 kubelet[1870]: W0612 14:06:33.206375 1870 cni.go:333] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "f7e30e4b6b5a5f1d3034292ca9f8710d17c19d54606c9191a33df4bcc07f17a7"
6月 12 14:06:33 m-k8s5 kubelet[1870]: W0612 14:06:33.208127 1870 docker_sandbox.go:402] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "ud-test-hangzhou-6vpgr-5f6c9f69b8-lfdqk_default": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "c8ec8901fcc793eae90f004ff9cbfb37daf6e0f2f457d5a4d0e8405a78a7bf30"
6月 12 14:06:33 m-k8s5 kubelet[1870]: E0612 14:06:33.240674 1870 cni.go:387] Error deleting lxgtest_lxgtest-7d5bff8bfc-xn4df/f7e30e4b6b5a5f1d3034292ca9f8710d17c19d54606c9191a33df4bcc07f17a7 from network calico/cni0: error getting ClusterInformation: Get "https://[10.233.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.233.0.1:443: connect: connection refused

@rambohe-ch
Copy link
Member

@lclxg Thank you for uploading detail logs.

  • reason:
    Pods on node can not start because cni network plugin(calico) is not ready, and calico can not able to running for connection to 10.233.0.1:443 is refused.

  • solution:
    we need to configure the setting of calico and make sure calico access kube-apiserver through yurthub. yurthub proxy address is: http://127.0.0.1:10261, so calico can be restarted by using the yurthub local cache when cloud-edge network disconnected.

@lclxg
Copy link
Author

lclxg commented Jun 15, 2021

@lclxg Thank you for uploading detail logs.

  • reason:
    Pods on node can not start because cni network plugin(calico) is not ready, and calico can not able to running for connection to 10.233.0.1:443 is refused.
  • solution:
    we need to configure the setting of calico and make sure calico access kube-apiserver through yurthub. yurthub proxy address is: http://127.0.0.1:10261, so calico can be restarted by using the yurthub local cache when cloud-edge network disconnected.

@rambohe-ch Thank you for your reply
> calico is a deamonset , the pod runing in mater node no need to change the yurthub proxy and others running in the edge nodes need to change apiserver to yurthub proxy. Is the project support this configure? I do not find the code。

@rambohe-ch
Copy link
Member

rambohe-ch commented Jun 17, 2021

@lclxg Thank you for uploading detail logs.

  • reason:
    Pods on node can not start because cni network plugin(calico) is not ready, and calico can not able to running for connection to 10.233.0.1:443 is refused.
  • solution:
    we need to configure the setting of calico and make sure calico access kube-apiserver through yurthub. yurthub proxy address is: http://127.0.0.1:10261, so calico can be restarted by using the yurthub local cache when cloud-edge network disconnected.

@rambohe-ch Thank you for your reply

calico is a deamonset , the pod runing in mater node no need to change the yurthub proxy and others running in the edge nodes need to change apiserver to yurthub proxy. Is the project support this configure? I do not find the code。

Now we need to configure the calico daemonset like the following setting.

      containers:
      - name: kube-proxy
        command:
        - "/bin/sh"
        - "-c"
        - |
          set -x
          if [ -e /etc/kubernetes/manifests/edge-hub-dp.yaml ]; then
              /usr/local/bin/kube-proxy --master=http://127.0.0.1:10261 --cluster-cidr={{.CIDR}} --hostname-override=$(NODE_NAME) {{if .ProxyMode }} --proxy-mode={{.ProxyMode}} {{end}}
          else
              /usr/local/bin/kube-proxy --kubeconfig=/var/lib/kube-proxy/kubeconfig.conf --cluster-cidr={{.CIDR}} --hostname-override=$(NODE_NAME) {{if .ProxyMode }} --proxy-mode={{.ProxyMode}} {{end}}
          fi

and we plan to solve this challenge in OpenYurt v0.5.0 without configure calico daemonset by instructing Yurthub endpoint info into pods automatically. if you want to contribute this feature, Please let me know it.

@lclxg
Copy link
Author

lclxg commented Jun 17, 2021

@rambohe-ch It's a great plan. I'm great to contribute

@rambohe-ch
Copy link
Member

@rambohe-ch It's a great plan. I'm great to contribute

@lclxg Would you like to apply a member of openyurt community? so we can discuss this feature in an openyurt member group before submit feature proposal.

you can make up a issue in openyurtio/community repo. An issue example: openyurtio/community#18

@stale
Copy link

stale bot commented Dec 30, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Dec 30, 2021
@stale stale bot closed this as completed Jan 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants