Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] libovsdb keeps reconnecting to NorthDB every 4 seconds #4465

Open
cmdy opened this issue Sep 2, 2024 · 4 comments
Open

[BUG] libovsdb keeps reconnecting to NorthDB every 4 seconds #4465

cmdy opened this issue Sep 2, 2024 · 4 comments
Labels
performance Anything that can make Kube-OVN faster

Comments

@cmdy
Copy link

cmdy commented Sep 2, 2024

Kube-OVN Version

v1.12.21

Kubernetes Version

v1.28.11

Operation-system/Kernel Version

"CentOS Linux 7 (Core)"
6.6.35-2406.el7.bzl.x86_64

Description

libovsdb keeps reconnecting to OVN_Northbound every 4 seconds, Causes not connected.

I0902 16:57:06.030515       7 client.go:333] "trying to connect" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641"
I0902 16:57:06.052681       7 client.go:266] "successfully connected" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641" sid="d8dbc307-0bdf-45c8-b8c0-2433e829e4d6"
I0902 16:57:06.052732       7 client.go:288] "reconnected - restarting monitors" logger="libovsdb" db="nbdb" database="OVN_Northbound"
I0902 16:57:10.053805       7 client.go:333] "trying to connect" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641"
I0902 16:57:10.080624       7 client.go:266] "successfully connected" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641" sid="d8dbc307-0bdf-45c8-b8c0-2433e829e4d6"
I0902 16:57:10.080645       7 client.go:288] "reconnected - restarting monitors" logger="libovsdb" db="nbdb" database="OVN_Northbound"
I0902 16:57:14.081880       7 client.go:333] "trying to connect" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641"
I0902 16:57:14.109087       7 client.go:266] "successfully connected" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641" sid="d8dbc307-0bdf-45c8-b8c0-2433e829e4d6"
I0902 16:57:14.109110       7 client.go:288] "reconnected - restarting monitors" logger="libovsdb" db="nbdb" database="OVN_Northbound"
I0902 16:57:18.130676       7 client.go:333] "trying to connect" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641"
I0902 16:57:18.164363       7 client.go:266] "successfully connected" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641" sid="d8dbc307-0bdf-45c8-b8c0-2433e829e4d6"
I0902 16:57:18.164404       7 client.go:288] "reconnected - restarting monitors" logger="libovsdb" db="nbdb" database="OVN_Northbound"
I0902 16:57:22.166024       7 client.go:333] "trying to connect" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641"
I0902 16:57:22.182297       7 client.go:266] "successfully connected" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641" sid="d8dbc307-0bdf-45c8-b8c0-2433e829e4d6"
I0902 16:57:22.182323       7 client.go:288] "reconnected - restarting monitors" logger="libovsdb" db="nbdb" database="OVN_Northbound"
I0902 16:57:26.182808       7 client.go:333] "trying to connect" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641"
I0902 16:57:26.202551       7 client.go:266] "successfully connected" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641" sid="d8dbc307-0bdf-45c8-b8c0-2433e829e4d6"
I0902 16:57:26.202568       7 client.go:288] "reconnected - restarting monitors" logger="libovsdb" db="nbdb" database="OVN_Northbound"
I0902 16:57:30.203127       7 client.go:333] "trying to connect" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641"
I0902 16:57:30.242040       7 client.go:266] "successfully connected" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641" sid="d8dbc307-0bdf-45c8-b8c0-2433e829e4d6"
I0902 16:57:30.242066       7 client.go:288] "reconnected - restarting monitors" logger="libovsdb" db="nbdb" database="OVN_Northbound"
I0902 16:57:34.243638       7 client.go:333] "trying to connect" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641"
I0902 16:57:34.258753       7 client.go:266] "successfully connected" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641" sid="d8dbc307-0bdf-45c8-b8c0-2433e829e4d6"
I0902 16:57:34.258785       7 client.go:288] "reconnected - restarting monitors" logger="libovsdb" db="nbdb" database="OVN_Northbound"
I0902 16:57:38.267088       7 client.go:333] "trying to connect" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641"
I0902 16:57:38.446440       7 client.go:266] "successfully connected" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641" sid="d8dbc307-0bdf-45c8-b8c0-2433e829e4d6"
I0902 16:57:38.446483       7 client.go:288] "reconnected - restarting monitors" logger="libovsdb" db="nbdb" database="OVN_Northbound"
I0902 16:57:42.447479       7 client.go:333] "trying to connect" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641"
I0902 16:57:43.150550       7 client.go:266] "successfully connected" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641" sid="d8dbc307-0bdf-45c8-b8c0-2433e829e4d6"
I0902 16:57:43.150576       7 client.go:288] "reconnected - restarting monitors" logger="libovsdb" db="nbdb" database="OVN_Northbound"
I0902 16:57:47.159898       7 client.go:333] "trying to connect" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641"
I0902 16:57:47.177111       7 client.go:266] "successfully connected" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641" sid="d8dbc307-0bdf-45c8-b8c0-2433e829e4d6"
I0902 16:57:47.177131       7 client.go:288] "reconnected - restarting monitors" logger="libovsdb" db="nbdb" database="OVN_Northbound"
I0902 16:57:51.177921       7 client.go:333] "trying to connect" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641"
I0902 16:57:51.195456       7 client.go:266] "successfully connected" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641" sid="d8dbc307-0bdf-45c8-b8c0-2433e829e4d6"
I0902 16:57:51.195480       7 client.go:288] "reconnected - restarting monitors" logger="libovsdb" db="nbdb" database="OVN_Northbound"
I0902 16:57:55.246549       7 client.go:333] "trying to connect" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641"
I0902 16:57:55.330913       7 client.go:266] "successfully connected" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641" sid="d8dbc307-0bdf-45c8-b8c0-2433e829e4d6"
I0902 16:57:55.330941       7 client.go:288] "reconnected - restarting monitors" logger="libovsdb" db="nbdb" database="OVN_Northbound"
I0902 16:57:59.331337       7 client.go:333] "trying to connect" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641"
I0902 16:57:59.355130       7 client.go:266] "successfully connected" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641" sid="d8dbc307-0bdf-45c8-b8c0-2433e829e4d6"
I0902 16:57:59.355150       7 client.go:288] "reconnected - restarting monitors" logger="libovsdb" db="nbdb" database="OVN_Northbound"
I0902 16:58:03.356031       7 client.go:333] "trying to connect" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641"
I0902 16:58:03.381473       7 client.go:266] "successfully connected" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641" sid="d8dbc307-0bdf-45c8-b8c0-2433e829e4d6"
I0902 16:58:03.381495       7 client.go:288] "reconnected - restarting monitors" logger="libovsdb" db="nbdb" database="OVN_Northbound"
I0902 16:58:07.430566       7 client.go:333] "trying to connect" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641"
I0902 16:58:07.452988       7 client.go:266] "successfully connected" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641" sid="d8dbc307-0bdf-45c8-b8c0-2433e829e4d6"
I0902 16:58:07.453051       7 client.go:288] "reconnected - restarting monitors" logger="libovsdb" db="nbdb" database="OVN_Northbound"
I0902 16:58:11.453964       7 client.go:333] "trying to connect" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641"
I0902 16:58:11.483360       7 client.go:266] "successfully connected" logger="libovsdb" db="nbdb" database="OVN_Northbound" endpoint="tcp:[10.106.32.60]:6641" sid="d8dbc307-0bdf-45c8-b8c0-2433e829e4d6"
I0902 16:58:11.483381       7 client.go:288] "reconnected - restarting monitors" logger="libovsdb" db="nbdb" database="OVN_Northbound"
I0902 17:01:18.479974       7 pod.go:947] handle delete pod test-ns/subnet-fake-18-7fddbfdcdb-hltnj
E0902 17:01:18.487229       7 ovn-nb-logical_router_policy.go:96] get policy priority 29250 match ip4.src == 6.104.121.168 in logical router ovn-cluster: not connected
E0902 17:01:18.487258       7 pod.go:1036] failed to delete static route, get policy priority 29250 match ip4.src == 6.104.121.168 in logical router ovn-cluster: not connected
E0902 17:01:18.487294       7 pod.go:467] error syncing 'test-ns/subnet-fake-15-77ff47ddc5-ntgf5': get policy priority 29250 match ip4.src == 6.104.121.168 in logical router ovn-cluster: not connected, requeuing
I0902 17:01:18.487306       7 pod.go:947] handle delete pod test-ns/subnet-fake-18-7fddbfdcdb-6ttrk
I0902 17:01:18.487346       7 pod.go:947] handle delete pod test-ns/subnet-fake-15-77ff47ddc5-j4p2n
E0902 17:01:18.489319       7 ovn-nb-logical_router_policy.go:96] get policy priority 29250 match ip4.src == 6.104.122.148 in logical router ovn-cluster: not connected
E0902 17:01:18.489338       7 pod.go:1036] failed to delete static route, get policy priority 29250 match ip4.src == 6.104.122.148 in logical router ovn-cluster: not connected
E0902 17:01:18.489359       7 pod.go:467] error syncing 'test-ns/subnet-fake-15-77ff47ddc5-wl8tb': get policy priority 29250 match ip4.src == 6.104.122.148 in logical router ovn-cluster: not connected, requeuing
I0902 17:01:18.489369       7 pod.go:947] handle delete pod test-ns/subnet-fake-18-7fddbfdcdb-kfjcv
E0902 17:01:18.489624       7 ovn-nb-logical_router_policy.go:96] get policy priority 29250 match ip4.src == 6.104.123.60 in logical router ovn-cluster: not connected
E0902 17:01:18.489650       7 pod.go:1036] failed to delete static route, get policy priority 29250 match ip4.src == 6.104.123.60 in logical router ovn-cluster: not connected
E0902 17:01:18.489674       7 pod.go:467] error syncing 'test-ns/subnet-fake-15-77ff47ddc5-t95pf': get policy priority 29250 match ip4.src == 6.104.123.60 in logical router ovn-cluster: not connected, requeuing
I0902 17:01:18.490520       7 pod.go:947] handle delete pod test-ns/subnet-fake-18-7fddbfdcdb-tkw56
E0902 17:01:18.496668       7 ovn-nb-logical_router_policy.go:96] get policy priority 29250 match ip4.src == 6.104.122.131 in logical router ovn-cluster: not connected
E0902 17:01:18.496692       7 pod.go:1036] failed to delete static route, get policy priority 29250 match ip4.src == 6.104.122.131 in logical router ovn-cluster: not connected
E0902 17:01:18.496720       7 pod.go:467] error syncing 'test-ns/subnet-fake-15-77ff47ddc5-zvjh2': get policy priority 29250 match ip4.src == 6.104.122.131 in logical router ovn-cluster: not connected, requeuing
I0902 17:01:18.496731       7 pod.go:947] handle delete pod test-ns/subnet-fake-18-7fddbfdcdb-8wsvj
E0902 17:01:18.497290       7 ovn-nb-logical_router_policy.go:96] get policy priority 29250 match ip4.src == 6.104.123.78 in logical router ovn-cluster: not connected
E0902 17:01:18.497309       7 pod.go:1036] failed to delete static route, get policy priority 29250 match ip4.src == 6.104.123.78 in logical router ovn-cluster: not connected
E0902 17:01:18.497333       7 pod.go:467] error syncing 'test-ns/subnet-fake-15-77ff47ddc5-gckfn': get policy priority 29250 match ip4.src == 6.104.123.78 in logical router ovn-cluster: not connected, requeuing
E0902 17:01:18.497584       7 ovn-nb-logical_router_policy.go:96] get policy priority 29250 match ip4.src == 6.104.146.46 in logical router ovn-cluster: not connected
E0902 17:01:18.497609       7 pod.go:1036] failed to delete static route, get policy priority 29250 match ip4.src == 6.104.146.46 in logical router ovn-cluster: not connected
E0902 17:01:18.497637       7 pod.go:467] error syncing 'test-ns/subnet-fake-18-7fddbfdcdb-wknxh': get policy priority 29250 match ip4.src == 6.104.146.46 in logical router ovn-cluster: not connected, requeuing
I0902 17:01:18.497774       7 pod.go:947] handle delete pod test-ns/subnet-fake-18-7fddbfdcdb-8v9j7
E0902 17:01:18.497957       7 ovn-nb-logical_router_policy.go:96] get policy priority 29250 match ip4.src == 6.104.125.150 in logical router ovn-cluster: not connected
E0902 17:01:18.497971       7 pod.go:1036] failed to delete static route, get policy priority 29250 match ip4.src == 6.104.125.150 in logical router ovn-cluster: not connected
E0902 17:01:18.497991       7 pod.go:467] error syncing 'test-ns/subnet-fake-15-77ff47ddc5-74gcb': get policy priority 29250 match ip4.src == 6.104.125.150 in logical router ovn-cluster: not connected, requeuing
I0902 17:01:18.497992       7 pod.go:947] handle delete pod test-ns/subnet-fake-18-7fddbfdcdb-9xxgs
E0902 17:01:18.498319       7 ovn-nb-logical_router_policy.go:96] get policy priority 29250 match ip4.src == 6.104.120.108 in logical router ovn-cluster: not connected
E0902 17:01:18.498335       7 pod.go:1036] failed to delete static route, get policy priority 29250 match ip4.src == 6.104.120.108 in logical router ovn-cluster: not connected
E0902 17:01:18.498357       7 pod.go:467] error syncing 'test-ns/subnet-fake-15-77ff47ddc5-qsqv7': get policy priority 29250 match ip4.src == 6.104.120.108 in logical router ovn-cluster: not connected, requeuing

Database tradition is normal

d8db
Name: OVN_Northbound
Cluster ID: fed4 (fed4cd9e-b9aa-403a-841a-b25a3a90b7ab)
Server ID: d8db (d8dbc307-0bdf-45c8-b8c0-2433e829e4d6)
Address: tcp:[10.106.32.60]:6643
Status: cluster member
Role: leader
Term: 178
Leader: self
Vote: self

Last Election started 3278210 ms ago, reason: leadership_transfer
Last Election won: 3278209 ms ago
Election timer: 5000
Log: [244234, 245204]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->88d8 ->c318 <-88d8 <-c318
Disconnections: 0
Servers:
    88d8 (88d8 at tcp:[10.106.32.59]:6643) next_index=245204 match_index=245203 last msg 2350 ms ago
    d8db (d8db at tcp:[10.106.32.60]:6643) (self) next_index=245203 match_index=245203
    c318 (c318 at tcp:[10.106.32.61]:6643) next_index=245204 match_index=245203 last msg 2350 ms ago
status: ok
e316
Name: OVN_Southbound
Cluster ID: c07f (c07f7308-d993-489e-b236-2645c348e8c6)
Server ID: e316 (e316219f-bf5e-4016-892c-4cc63c2be185)
Address: tcp:[10.106.32.61]:6644
Status: cluster member
Role: leader
Term: 25612
Leader: self
Vote: self

Last Election started 20159889 ms ago, reason: timeout
Last Election won: 20159889 ms ago
Election timer: 5000
Log: [37999, 40031]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->e633 ->b2c6 <-b2c6 <-e633
Disconnections: 4
Servers:
    e633 (e633 at tcp:[10.106.32.60]:6644) next_index=40031 match_index=40030 last msg 1522 ms ago
    b2c6 (b2c6 at tcp:[10.106.32.59]:6644) next_index=40031 match_index=40030 last msg 1522 ms ago
    e316 (e316 at tcp:[10.106.32.61]:6644) (self) next_index=38004 match_index=40030
status: ok

Steps To Reproduce

Large number of pod deletes
8w pods

Current Behavior

unable to connect to OVN_Northbound,delete logical router causes not connected

Expected Behavior

can connect to OVN_Northbound normally

@cmdy cmdy added the bug Something isn't working label Sep 2, 2024
@zhangzujian zhangzujian added performance Anything that can make Kube-OVN faster and removed bug Something isn't working labels Sep 3, 2024
@zhangzujian
Copy link
Member

Could you please upgrade kube-ovn to the latest v1.12.22 and then set the kube-ovn-controller image tag to v1.12.23-x86? You also need to add two arguments to kube-ovn-controller after setting the image tag:

args:
  - /kube-ovn/start-controller.sh
  - --ovsdb-con-timeout=30         # The seconds to wait ovsdb connect timeout
  - --ovsdb-inactivity-timeout=180 # The seconds to wait ovsdb inactivity check timeout

@cmdy
Copy link
Author

cmdy commented Sep 6, 2024

Could you please upgrade kube-ovn to the latest v1.12.22 and then set the kube-ovn-controller image tag to v1.12.23-x86? You also need to add two arguments to kube-ovn-controller after setting the image tag:

args:
  - /kube-ovn/start-controller.sh
  - --ovsdb-con-timeout=30         # The seconds to wait ovsdb connect timeout
  - --ovsdb-inactivity-timeout=180 # The seconds to wait ovsdb inactivity check timeout

Yes, I tested it and found that the reason for this problem is that the monitor timeout is too short during reconnection.

const timeout = 3 * time.Second
...
	options := []client.Option{
		client.WithReconnect(timeout, &backoff.ConstantBackOff{Interval: time.Second}),
		client.WithLeaderOnly(true),
		client.WithLogger(&logger),
	}

The timeout here is hard-coded to be 3 seconds.
After I adjusted the timeout, this problem can be solved. I will update to the latest version for testing later.

@zhangzujian
Copy link
Member

The timeout here is hard-coded to be 3 seconds.
After I adjusted the timeout, this problem can be solved.

@cmdy What's your final timeout setting?

@cmdy
Copy link
Author

cmdy commented Sep 6, 2024

The timeout here is hard-coded to be 3 seconds.
After I adjusted the timeout, this problem can be solved.

@cmdy What's your final timeout setting?

I set 12 seconds under 10w LogicalRouterPolicy, and the program can reconnect normally and continue to work, but I think 30s is a reasonable value, Because our ultimate goal is 3k nodes, in this scenario, 100 subnets will have 30w+ LogicalRouterPolicies. However, I have not yet tested this scenario, because there are other issues that cause adding nodes to be too slow (I will open a new issue for this) .
By the way, this problem occurs 100% when ovn-central has a leader change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Anything that can make Kube-OVN faster
Projects
None yet
Development

No branches or pull requests

2 participants