Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR Error encoding stats err="json: unsupported value: NaN" #280

Open
Toudahl opened this issue Nov 10, 2024 · 9 comments
Open

ERROR Error encoding stats err="json: unsupported value: NaN" #280

Toudahl opened this issue Nov 10, 2024 · 9 comments
Labels
troubleshooting Maybe bug, maybe not

Comments

@Toudahl
Copy link

Toudahl commented Nov 10, 2024

Hi,

One of my agents , running in docker is reporting some errors.

I am not sure how to go about debugging them it, or even fixing it.

2024/11/09 17:02:35 INFO Detected root device name=mmcblk0p2
2024/11/09 17:02:35 INFO Detected network interface name=eth0 sent=818916 recv=989631
2024/11/09 17:02:35 INFO Starting SSH server address=:45876
2024/11/09 17:15:14 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 17:15:14 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 17:22:35 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 17:22:35 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 17:26:56 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 17:26:56 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 17:51:05 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 17:51:05 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 17:59:25 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 17:59:25 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 18:05:25 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 18:05:25 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/09 18:05:44 ERROR Error encoding stats err=EOF

@henrygd
Copy link
Owner

henrygd commented Nov 10, 2024

Please add LOG_LEVEL=debug to the agent.

What Docker version and device / OS are you using?

@henrygd henrygd added the troubleshooting Maybe bug, maybe not label Nov 10, 2024
@Toudahl
Copy link
Author

Toudahl commented Nov 10, 2024

I have enabled debug logging, and I am waiting for the error to reappear.

I am running docker 24.0.6, debian 11 on a rpi3b+.

The agent is version 0.7.2.

Apologies for not including this before.

@Toudahl
Copy link
Author

Toudahl commented Nov 10, 2024

2024/11/10 22:27:17 DEBUG Getting stats
2024/11/10 22:27:17 DEBUG Temperature sensors="[{\"sensorKey\":\"cpu_thermal\",\"temperature\":62.3,\"sensorHigh\":0,\"sensorCritical\":110}]"
2024/11/10 22:27:17 DEBUG sysinfo data="{Hostname:rpi3bp-01 KernelVersion:6.1.21-v8+ Cores:4 Threads:4 CpuModel:Cortex-A53 Uptime:107421 Cpu:0.33 MemPct:19.53 DiskPct:21.68 Bandwidth:0 AgentVersion:0.7.2}"
2024/11/10 22:27:17 DEBUG System stats data="{Stats:{Cpu:0.33 MaxCpu:0 Mem:0.89 MemUsed:0.17 MemPct:19.53 MemBuffCache:0.49 MemZfsArc:0 Swap:0.1 SwapUsed:0 DiskTotal:14.32 DiskUsed:2.97 DiskPct:21.68 DiskReadPs:0.05 DiskWritePs:0 MaxDiskReadPs:0 MaxDiskWritePs:0 NetworkSent:0 NetworkRecv:0 MaxNetworkSent:0 MaxNetworkRecv:0 Temperatures:map[cpu_thermal:62.3] ExtraFs:map[]} Info:{Hostname:rpi3bp-01 KernelVersion:6.1.21-v8+ Cores:4 Threads:4 CpuModel:Cortex-A53 Uptime:107421 Cpu:0.33 MemPct:19.53 DiskPct:21.68 Bandwidth:0 AgentVersion:0.7.2} Containers:[]}"
2024/11/10 22:27:17 DEBUG Getting stats
2024/11/10 22:27:17 DEBUG Temperature sensors="[{\"sensorKey\":\"cpu_thermal\",\"temperature\":62.3,\"sensorHigh\":0,\"sensorCritical\":110}]"
2024/11/10 22:27:17 DEBUG sysinfo data="{Hostname:rpi3bp-01 KernelVersion:6.1.21-v8+ Cores:4 Threads:4 CpuModel:Cortex-A53 Uptime:107421 Cpu:66.67 MemPct:19.53 DiskPct:21.68 Bandwidth:0.16 AgentVersion:0.7.2}"
2024/11/10 22:27:17 DEBUG System stats data="{Stats:{Cpu:66.67 MaxCpu:0 Mem:0.89 MemUsed:0.17 MemPct:19.53 MemBuffCache:0.49 MemZfsArc:0 Swap:0.1 SwapUsed:0 DiskTotal:14.32 DiskUsed:2.97 DiskPct:21.68 DiskReadPs:0 DiskWritePs:0 MaxDiskReadPs:0 MaxDiskWritePs:0 NetworkSent:0.12 NetworkRecv:0.04 MaxNetworkSent:0 MaxNetworkRecv:0 Temperatures:map[cpu_thermal:62.3] ExtraFs:map[]} Info:{Hostname:rpi3bp-01 KernelVersion:6.1.21-v8+ Cores:4 Threads:4 CpuModel:Cortex-A53 Uptime:107421 Cpu:66.67 MemPct:19.53 DiskPct:21.68 Bandwidth:0.16 AgentVersion:0.7.2} Containers:[]}"
2024/11/10 22:27:18 DEBUG Docker stats data="[0x40000e64d0 0x400006a310]"
2024/11/10 22:27:18 DEBUG Extra filesystems data=map[]
2024/11/10 22:27:18 DEBUG Docker stats data="[0x400006a310 0x40000e64d0]"
2024/11/10 22:27:18 DEBUG Extra filesystems data=map[]
2024/11/10 22:27:18 ERROR Error encoding stats err="json: unsupported value: NaN"
2024/11/10 22:27:18 ERROR Error encoding stats err="json: unsupported value: NaN"

@henrygd
Copy link
Owner

henrygd commented Nov 10, 2024

Thanks. Is it really logging everything twice or was that just a timing coincidence when you started it up?

Can you please run docker stats --no-stream and check if it shows zero values for anything?

I'd recommend upgrading Docker if possible. There's a known bug in 24 that I had to specifically add a workaround for. I don't know if related but it's a good idea to update either way.

@Toudahl
Copy link
Author

Toudahl commented Nov 11, 2024

I have now updated to docker 27 - i must have forgotten this rpi when i updated my cluster.
The stats show 0% cpu for beszel, and 0 in network I/O for both containers running on this pi.

I have this in my /boot/cmdline.txt
console=serial0,115200 console=tty1 root=PARTUUID=202246d5-02 rootfstype=ext4 fsck.repair=yes rootwait cgroup_enable=cpuset cgroup_enable=memory cgroup_memory=1

The logs are showing up twice, both in my syslog server - and when i inspect the logs manually using the docker command. It is happening consistently, not just close to a startup of the agent.

@henrygd
Copy link
Owner

henrygd commented Nov 11, 2024

The issue may be that multiple instances of the hub are requesting metrics from the agent at the same time.

There's no lock coordination between multiple instances atm. Try to make sure only one total instance is running.

@Toudahl
Copy link
Author

Toudahl commented Nov 11, 2024

There is only one hub running.
image

On the hub i am actually seeing this

infrastructure_beszel.1.a6vox3xn33lp@rpi4-02    | 2024/11/11 19:10:38 failed to get active systems err context canceled; failed query: SELECT `@@__invalidCollectionModelOrIdentifier`.* FROM `@@__invalidCollectionModelOrIdentifier` WHERE status = 'up'

@henrygd
Copy link
Owner

henrygd commented Nov 11, 2024

I think this is the same issue as #216 actually. You can also reference this thread about swarm: #17.

It seems to be possible to get working properly, but I haven't set up swarm yet to figure it out.

With swarm, it would be better for the agents to use an internal timer and send stats to the hub, rather than how it's set up currently where the hub requests from the agent. I want to add that option but it's not going to be in the very near future.

I think what's going wrong is that the hub connections are not being handled by the correct nodes, and sometimes one node is being assigned two connections.

Re: failed to get active systems error - that's the first time I've seen someone hit that. Are you using the arm7 image for the hub by any chance? There's an open issue about that image having a problem with the DB (#133) but I haven't been able to troubleshoot it because I don't have an arm7 device.

It may go away if you can figure out a working configuration with the swarm nodes, otherwise you may want to delete the beszel_data folder and start over if it continues to happen.

@Toudahl
Copy link
Author

Toudahl commented Nov 13, 2024

Sorry for the slow reply time.

Some more details of my system.

I have 1 swarm, with 4 nodes.
These are all running debian 11, and docker 27. The hosts them self are 3 x rpi4 and 1 x rpi4b
This is where the screengrab is from. As you can see it is running 1 hub, and 4 agents using host network.
The hub is connecting to the agents through host ip. So no docker networking involved there.

The original problem i posted about takes places on a separate pi, with details as described.

So aside from the agent running on the rpi3b+, the hub and other agents are running on arm.

Based on this information, I do not think it is likely that the problem is caused by swarm. But if you wish to test it your self, you can easily make a swarm "cluster" with just one node. Perhaps you can trigger the issues there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
troubleshooting Maybe bug, maybe not
Projects
None yet
Development

No branches or pull requests

2 participants