Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ovnkube-node crash loops when trying to restart #4654

Closed
dceara opened this issue Aug 27, 2024 · 8 comments
Closed

ovnkube-node crash loops when trying to restart #4654

dceara opened this issue Aug 27, 2024 · 8 comments
Assignees
Labels
feature/user-defined-network-segmentation All PRs related to User defined network segmentation kind/bug All issues that are bugs and PRs opened to fix bugs lifecycle/stale All issues (> 60 days) and PRs (>90 days) with no activity.

Comments

@dceara
Copy link
Contributor

dceara commented Aug 27, 2024

What happened?

On a freshly started kind cluster (multi-network and network segmentation enabled):

$ ./kind.sh -ds -ic -mne -nse

Delete an ovnkube-node pod:

$ oc delete pod -n ovn-kubernetes $ovnk

The new ovnkube-node pod fails and crash loops because it fails to start the node network controller:

$ oc get pod -n ovn-kubernetes
NAME                                     READY   STATUS             RESTARTS      AGE
ovnkube-control-plane-589c64c694-p4bsw   1/1     Running            0             17h
ovnkube-identity-794d5bb9dd-9m74d        1/1     Running            0             17h
ovnkube-node-8kkdv                       6/6     Running            0             17h
ovnkube-node-nq2m7                       6/6     Running            0             17h
ovnkube-node-qvvcq                       5/6     CrashLoopBackOff   2 (23s ago)   3m28s
ovs-node-6fkft                           1/1     Running            0             17h
ovs-node-h4l5k                           1/1     Running            0             17h
ovs-node-qxqzz                           1/1     Running 

Logs of the ovnkube-node pod (full logs attached):

F0827 09:07:04.311697  375355 ovnkube.go:137] failed to run ovnkube: failed to start node network controller: failed to start default node network controller: unable to add gateway IP route for subnet: 10.96.0.0/16, route manager: failed to add route ({Ifindex: 12 Dst: 10.96.0.0/16 Src: 169.254.0.2 Gw: 169.254.0.4 Flags: [] Table: 0 Realm: 0}): failed to apply route ({Ifindex: 12 Dst: 10.96.0.0/16 Src: 169.254.0.2 Gw: 169.254.0.4 Flags: [] Table: 254 Realm: 0}): failed to add route (gw: 169.254.0.4, subnet 10.96.0.0/16, mtu 1400, src IP 169.254.0.2): file exists

ovnk-logs.txt

What did you expect to happen?

The new ovnkube-node pod should come up without issues.

How can we reproduce it (as minimally and precisely as possible)?

Described above.

Anything else we need to know?

No response

OVN-Kubernetes version

$ ovnkube --version
# paste output here

Kubernetes version

$ kubectl version
# paste output here

OVN version

$ oc rsh -n ovn-kubernetes ovnkube-node-xxxxx (pick any ovnkube-node pod on your cluster)
$ rpm -q ovn
# paste output here

OVS version

$ oc rsh -n ovn-kubernetes ovs-node-xxxxx (pick any ovs pod on your cluster)
$ rpm -q openvswitch
# paste output here

Platform

Is it baremetal? GCP? AWS? Azure?

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

@dceara dceara added kind/bug All issues that are bugs and PRs opened to fix bugs feature/user-defined-network-segmentation All PRs related to User defined network segmentation labels Aug 27, 2024
@tssurya tssurya self-assigned this Aug 27, 2024
@tssurya
Copy link
Contributor

tssurya commented Aug 27, 2024

looks like it might be related to recent changes that were done
we should fix this before getting the ds merge done

@trozet trozet assigned trozet and unassigned tssurya Aug 27, 2024
@trozet
Copy link
Contributor

trozet commented Aug 27, 2024

It looks to me like this is fixed by #4652

I deleted ovnk pods multiple times and not seeing the issue. Feel free to reopen if it happens again.

@trozet trozet closed this as completed Aug 27, 2024
@dceara
Copy link
Contributor Author

dceara commented Aug 27, 2024

I just tried on master:

# git log
commit 24108b821289b9b7ae410a9dffee8b1fcabbb24a (HEAD -> master, origin/master, origin/HEAD)
Merge: 1179e4d58 9baca6621
Author: Tim Rozet <[email protected]>
Date:   Tue Aug 27 12:04:19 2024 -0400

    Merge pull request #4652 from trozet/serialize_NAD_startup
    
    Serializes Network Manager Start up

And I get the same crash.

I started kind with:

./kind.sh -ds -ic -mne -nse

Then I deleted the ovnkube-node pod corresponding to ovn-worker:

# ovnk=$(oc get pod -n ovn-kubernetes -o wide | grep ovnkube-node | grep 'ovn-worker ' | awk '{print $1}')
# oc delete pod -n ovn-kubernetes $ovnk

ovnkube fails in the same way:

F0827 18:41:40.589014    3240 ovnkube.go:137] failed to run ovnkube: failed to start node network controller: failed to start default node network controller: unable to add gateway IP route for subnet: 10.96.0.0/16, route manager: failed to add route ({Ifindex: 9 Dst: 10.96.0.0/16 Src: 169.254.0.2 Gw: 169.254.0.4 Flags: [] Table: 0 Realm: 0}): failed to apply route ({Ifindex: 9 Dst: 10.96.0.0/16 Src: 169.254.0.2 Gw: 169.254.0.4 Flags: [] Table: 254 Realm: 0}): failed to add route (gw: 169.254.0.4, subnet 10.96.0.0/16, mtu 1400, src IP 169.254.0.2): file exists

I'm not sure it's relevant but I'm using podman on that machine.

@dceara dceara reopened this Aug 27, 2024
@martinkennelly
Copy link
Contributor

I couldn't replicate the failure on main. Using docker.

@dceara
Copy link
Contributor Author

dceara commented Aug 29, 2024

I couldn't replicate the failure on main. Using docker.

I couldn't replicate the failure on main with docker either. Originally I was using podman, will try again.

@dceara
Copy link
Contributor Author

dceara commented Aug 29, 2024

@martinkennelly I moved back to using podman (just removed docker and installed podman and podman-docker) and now I get the same crash loop when deleting the ovnkube-node pod.

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the lifecycle/stale All issues (> 60 days) and PRs (>90 days) with no activity. label Oct 31, 2024
Copy link

github-actions bot commented Nov 9, 2024

This issue was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/user-defined-network-segmentation All PRs related to User defined network segmentation kind/bug All issues that are bugs and PRs opened to fix bugs lifecycle/stale All issues (> 60 days) and PRs (>90 days) with no activity.
Projects
None yet
Development

No branches or pull requests

4 participants