Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server is not restarted after restarting kubelet #79

Open
wengyao04 opened this issue Aug 10, 2023 · 5 comments
Open

Server is not restarted after restarting kubelet #79

wengyao04 opened this issue Aug 10, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@wengyao04
Copy link

wengyao04 commented Aug 10, 2023

Hi, We deploy k8s-rdma-shared-dev-plugin (artprod.dev.bloomberg.com/ds/yweng14/nvidia/cloud-native/k8s-rdma-shared-dev-plugin:v1.3.2) on our clusters and find that the the socket (ib.sock) is not recreated. We print some debugging message in the Restart() function and find it is blocked at rs.stop <- true

// Restart restart plugin server
func (rs *resourceServer) Restart() error {
	log.Printf("restarting %s device plugin server...", rs.resourceName)
	if rs.rsConnector == nil {
		fmt.Println("HPC Test line 225 rs.rsConnector is nil")
	}
	if rs.rsConnector.GetServer() == nil {
		fmt.Println("HPC test line 228 rs.rsConnector.GetServer() is nil")
	}

	if rs.rsConnector == nil || rs.rsConnector.GetServer() == nil {
		return fmt.Errorf("grpc server instance not found for %s", rs.resourceName)
	}

	fmt.Println("HPC test line 235 start stop connector and delete server")

	rs.rsConnector.Stop()
	fmt.Println("HPC test line 238 succeeds to stop rsConnector")
	rs.rsConnector.DeleteServer()
	fmt.Println("HPC test line 240 succeeds to delete server")

	// Send terminate signal to ListAndWatch()
	rs.stop <- true

	fmt.Println("HPC test line 245 start resource server")

	return rs.Start()
}

In our log we also see

2023/08/10 19:52:47 ListAndWatch stream close: context canceled

I think it is blocking here
https://github.com/Mellanox/k8s-rdma-shared-dev-plugin/blob/master/pkg/resources/server.go#L294-L299
When kubelet is restarted, the context is closed. Then the ListAndWatch print out message and return nil and thus stop channel is blocked.

We remove the context check (line 294 - 299), restart is not blocked and ib.sock is recreated. We also check how Nvidia GPU Device plugin implements ListAndWatch, they don't check context.Done()

I think this issue #74 may be also related this issue.

@adrianchiris adrianchiris added the bug Something isn't working label Aug 13, 2023
@adrianchiris
Copy link
Collaborator

What is the K8s version you are using ?
does the following path exists in your system: /var/lib/kubelet/plugins_registry ?

can you provide device plugin logs ?

@wengyao04
Copy link
Author

wengyao04 commented Aug 14, 2023

Hi @adrianchiris
Q: What is the K8s version you are using ?
A: We use 1.23

Q: does the following path exists in your system: /var/lib/kubelet/plugins_registry
A: Our kubelet path is /var/lib/kubelet, but we don't have /var/lib/kubelet/plugins_registry.

$ ls /var/lib/kubelet
device-plugins  pki
$ ls /var/lib/kubelet/device-plugins
DEPRECATION  ib.sock  kubelet_internal_checkpoint  kubelet.sock  nvidia.sock

Q: can you provide device plugin logs ?
A: When we start rdma-device-plugin, we see logs like the following, ib.sock is created.

2023/08/14 02:33:17 starting rdma/ib device plugin endpoint at: ib.sock
2023/08/14 02:33:17 rdma/ib device plugin endpoint started serving
2023/08/14 02:33:17 All servers started.
2023/08/14 02:33:17 Listening for term signals
2023/08/14 02:33:17 Starting OS watcher.
2023/08/14 02:33:17 Updating "rdma/ib" devices
2023/08/14 02:33:17 exposing "1000" devices

Then I manually restart kubelet

sudo systemctl restart kubelet

The logs are the following. The server is not restarted, and ib.sock is not recreated

2023/08/14 02:34:17 discovering host network devices
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:0b:00.0   02              Intel Corporation       Ethernet Controller 10G X550T
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:18:00.0   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:29:00.0   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:29:00.1   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:40:00.0   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:4f:00.0   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:5e:00.0   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:82:00.0   02              Intel Corporation       Ethernet Controller E810-C for QSFP
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:82:00.1   02              Intel Corporation       Ethernet Controller E810-C for QSFP
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:9a:00.0   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:aa:00.0   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:aa:00.1   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:c0:00.0   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:ce:00.0   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:dc:00.0   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 error creating new device: "missing RDMA device spec for device 0000:0b:00.0, RDMA device \"issm\" not found"
2023/08/14 02:34:17 no changes to devices for "rdma/ib"
2023/08/14 02:34:17 exposing "1000" devices
2023/08/14 02:35:17 discovering host network devices
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:0b:00.0   02              Intel Corporation       Ethernet Controller 10G X550T
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:18:00.0   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:29:00.0   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:29:00.1   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:40:00.0   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:4f:00.0   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:5e:00.0   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:82:00.0   02              Intel Corporation       Ethernet Controller E810-C for QSFP
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:82:00.1   02              Intel Corporation       Ethernet Controller E810-C for QSFP
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:9a:00.0   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:aa:00.0   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:aa:00.1   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:c0:00.0   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:ce:00.0   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:dc:00.0   02              Mellanox Technolo...    MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 error creating new device: "missing RDMA device spec for device 0000:0b:00.0, RDMA device \"issm\" not found"
2023/08/14 02:35:17 no changes to devices for "rdma/ib"
2023/08/14 02:35:17 exposing "1000" devices

@adrianchiris
Copy link
Collaborator

ack, so it will use the old way to register with kubelet and write resource sockets.

please check #82 it should solve the issue.

@wengyao04
Copy link
Author

Hi @adrianchiris Thank you very much for helping fix this issue, the pr is merged, could we have a new release ?

@adrianchiris
Copy link
Collaborator

v1.4.0 is out please check :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants