Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: GPU Stats suddenly vanishing #558

Open
Tsubajashi opened this issue Feb 11, 2025 · 6 comments
Open

[Bug]: GPU Stats suddenly vanishing #558

Tsubajashi opened this issue Feb 11, 2025 · 6 comments
Labels
bug Something isn't working

Comments

@Tsubajashi
Copy link

Description

as per this comment: #262 (comment)

the little log i get which doesn't mention a gpu at all... just restarted the service to see if i get any particular error listed, from a linux machine (Ubuntu 24.04 LTS), running the binary.

Ideally it should show 2 4090 GPUs (atleast on this machine)

As already mentioned, i did see the entries for a few days, but they suddenly vanished. i don't know where to start debugging this, especially since it happens across multiple devices.

If its needed to know: i run the hub on a docker container on a public server, and use tailscale to get the agents of my homelab listed.

Expected Behavior

To see all my GPUs listed per client (agent)

Steps to Reproduce

Sadly, im not sure. it suddenly disappeared without a trace.

OS / Architecture

Ubuntu 24.04 / AMD64

Beszel version

0.9.1

Installation method

Docker

Configuration

Hub Logs

beszel  | 2025/02/11 22:37:29 Server started at http://0.0.0.0:8090
beszel  | ├─ REST API:  http://0.0.0.0:8090/api/
beszel  | └─ Dashboard: http://0.0.0.0:8090/_/

Agent Logs

Feb 12 00:09:33 DESKTOP-ERJTEE9 systemd[1]: Started beszel-agent.service - Beszel Agent Service.

Feb 12 00:09:33 DESKTOP-ERJTEE9 beszel-agent[8279]: 2025/02/12 00:09:33 INFO Detected root device name=sdc

Feb 12 00:09:33 DESKTOP-ERJTEE9 beszel-agent[8279]: 2025/02/12 00:09:33 INFO Detected network interface name=eth0 sent=146496149 recv=95863462

Feb 12 00:09:33 DESKTOP-ERJTEE9 beszel-agent[8279]: 2025/02/12 00:09:33 INFO Detected network interface name=tailscale0 sent=2426518 recv=218951

Feb 12 00:09:33 DESKTOP-ERJTEE9 beszel-agent[8279]: 2025/02/12 00:09:33 INFO Starting SSH server address=:45876
@Tsubajashi Tsubajashi added the bug Something isn't working label Feb 11, 2025
@Tsubajashi Tsubajashi changed the title [Bug]: GPU Stats suddenly vanishing1 [Bug]: GPU Stats suddenly vanishing Feb 11, 2025
@henrygd
Copy link
Owner

henrygd commented Feb 12, 2025

Thanks, very strange since it was working previously and we haven't released a new version.

If you have multiple machines with GPUs, did they all vanish or was it only one machine?

Please try running the agent with env var LOG_LEVEL=debug. This will print a lot more information. Check if the GPUs are actually in your stats.

If there's a problem initializing the GPU functionality, it should print DEBUG GPU err=<error>. Let me know if you see that.

Also please run the command below for a minute. Make sure the formatting is consistent and it doesn't quit on its own. Also paste one of the lines here so I can see.

nvidia-smi -l 4 --query-gpu=index,name,temperature.gpu,memory.used,memory.total,utilization.gpu,power.draw --format=csv,noheader,nounits

@Tsubajashi
Copy link
Author

Tsubajashi commented Feb 12, 2025

logs from the agent:

Feb 12 01:04:23 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:04:23 DEBUG 0.9.1
Feb 12 01:04:23 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:04:23 DEBUG Not monitoring ZFS ARC err="open /proc/spl/kstat/zfs/arcstats: no such file or directory"
Feb 12 01:04:23 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:04:23 DEBUG Disk partitions="[{\"device\":\"/dev/sdc\",\"mountpoint\":\"/\",\"fstype\":\"ext4\",\"opts\":>
Feb 12 01:04:23 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:04:23 DEBUG Disk I/O diskstats="map[sda:{\"readCount\":1133,\"mergedReadCount\":400,\"writeCount\":0,\"me>
Feb 12 01:04:23 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:04:23 INFO Detected root device name=sdc
Feb 12 01:04:23 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:04:23 INFO Detected network interface name=eth0 sent=148841509 recv=97477311
Feb 12 01:04:23 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:04:23 INFO Detected network interface name=tailscale0 sent=2922534 recv=363514
Feb 12 01:04:23 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:04:23 DEBUG GPU err="no GPU found - install nvidia-smi or rocm-smi"
Feb 12 01:04:23 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:04:23 DEBUG Getting stats
Feb 12 01:04:23 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:04:23 DEBUG Temperature sensors=[]
Feb 12 01:04:23 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:04:23 DEBUG sysinfo data="{Hostname:DESKTOP-ERJTEE9 KernelVersion:5.15.167.4-microsoft-standard-WSL2 Core>
Feb 12 01:04:23 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:04:23 DEBUG System stats data="{Stats:{Cpu:6.25 MaxCpu:0 Mem:61.32 MemUsed:1.65 MemPct:2.68 MemBuffCache:>
Feb 12 01:04:23 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:04:23 DEBUG Docker stats data="[0xc000404150 0xc00031e070 0xc000098930 0xc000233490 0xc000404070 0xc00023>
Feb 12 01:04:23 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:04:23 DEBUG Extra filesystems data=map[]
Feb 12 01:04:23 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:04:23 DEBUG Stats data="{Stats:{Cpu:6.25 MaxCpu:0 Mem:61.32 MemUsed:1.65 MemPct:2.68 MemBuffCache:0.87 Me>
Feb 12 01:04:23 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:04:23 INFO Starting SSH server address=:45876
Feb 12 01:05:14 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:05:14 DEBUG Getting stats
Feb 12 01:05:14 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:05:14 DEBUG Temperature sensors=[]
Feb 12 01:05:14 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:05:14 DEBUG sysinfo data="{Hostname:DESKTOP-ERJTEE9 KernelVersion:5.15.167.4-microsoft-standard-WSL2 Core>
Feb 12 01:05:14 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:05:14 DEBUG System stats data="{Stats:{Cpu:0.2 MaxCpu:0 Mem:61.32 MemUsed:1.64 MemPct:2.68 MemBuffCache:1>
Feb 12 01:05:14 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:05:14 DEBUG Docker stats data="[0xc000404150 0xc00031e070 0xc000098930 0xc000233490 0xc000404070 0xc00023>
Feb 12 01:05:14 DESKTOP-ERJTEE9 beszel-agent[28216]: 2025/02/12 01:05:14 DEBUG Extra filesystems data=map[]

nvidia-smi log:

tsubajashi@DESKTOP-ERJTEE9:~$ nvidia-smi -l 4 --query-gpu=index,name,temperature.gpu,memory.used,memory.total,utilization.gpu,power.draw --format=csv,noheader,nounits
0, NVIDIA GeForce RTX 4090, 36, 1676, 24564, 1, 70.14
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 19.93
0, NVIDIA GeForce RTX 4090, 36, 1676, 24564, 0, 69.52
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 20.04
0, NVIDIA GeForce RTX 4090, 36, 1676, 24564, 0, 70.23
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 20.46
0, NVIDIA GeForce RTX 4090, 36, 1676, 24564, 0, 70.60
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 19.47
0, NVIDIA GeForce RTX 4090, 36, 1676, 24564, 0, 70.51
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 19.45
0, NVIDIA GeForce RTX 4090, 36, 1676, 24564, 1, 70.39
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 19.57
0, NVIDIA GeForce RTX 4090, 36, 1661, 24564, 0, 70.58
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 18.99
0, NVIDIA GeForce RTX 4090, 36, 1661, 24564, 4, 71.12
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 19.72
0, NVIDIA GeForce RTX 4090, 36, 1661, 24564, 0, 70.41
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 20.84
0, NVIDIA GeForce RTX 4090, 36, 1661, 24564, 0, 71.13
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 20.51
0, NVIDIA GeForce RTX 4090, 36, 1661, 24564, 0, 70.32
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 20.92
0, NVIDIA GeForce RTX 4090, 36, 1654, 24564, 0, 69.83
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 20.25
0, NVIDIA GeForce RTX 4090, 36, 1648, 24564, 0, 69.35
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 19.97
0, NVIDIA GeForce RTX 4090, 36, 1658, 24564, 1, 69.93
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 20.77
0, NVIDIA GeForce RTX 4090, 36, 1685, 24564, 0, 69.98
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 21.10
0, NVIDIA GeForce RTX 4090, 36, 1659, 24564, 0, 70.97
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 20.39
0, NVIDIA GeForce RTX 4090, 36, 1667, 24564, 2, 69.98
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 20.80
0, NVIDIA GeForce RTX 4090, 36, 1680, 24564, 0, 70.14
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 19.74
0, NVIDIA GeForce RTX 4090, 36, 1666, 24564, 0, 69.43
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 20.13
0, NVIDIA GeForce RTX 4090, 36, 1663, 24564, 0, 69.76
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 21.07
0, NVIDIA GeForce RTX 4090, 36, 1667, 24564, 1, 69.90
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 20.66
0, NVIDIA GeForce RTX 4090, 36, 1663, 24564, 0, 70.41
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 19.65
0, NVIDIA GeForce RTX 4090, 36, 1663, 24564, 0, 70.14
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 20.42
0, NVIDIA GeForce RTX 4090, 36, 1663, 24564, 0, 70.43
1, NVIDIA GeForce RTX 4090, 27, 2, 24564, 0, 20.59

so yea... the agent suddenly doesnt find it. ill try to get data from another agent where that suddenly began happening

guess we can keep it to the ubuntu one, as im not exactly sure how to check for logs of a service running on Windows (atleast no simple way)

@cyicz123
Copy link

I don't know if I have encountered this problem as well. I am a newcomer who just installed beszel. I installed the hub on a public server and then installed the agent on two A100 servers. Both agents were installed using the script for the binary version, not the docker version. However, I did not see any information about the GPU. I added Environment="GPU=true" to the service configuration file on one server, and after restarting the agent service, there is still no GPU information output. If you need more information, I am happy to provide it.

Image

@ItsNoted
Copy link

Having the same issue. Cannot seem to get GPU info from a PopOS machine with a 2080ti installed and working fine with nvidia-smi.

@henrygd
Copy link
Owner

henrygd commented Feb 12, 2025

I started a discussion in #563 for anyone having problems with GPU stats.

There's a small program there which should help figure out what's going wrong.

@henrygd henrygd moved this to Bug backlog in Beszel Roadmap Feb 16, 2025
@henrygd
Copy link
Owner

henrygd commented Feb 18, 2025

Update with possible solution here: #563 (reply in thread)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Bug backlog
Development

No branches or pull requests

4 participants