Fan 2 Unable to Properly Read Nvidia GPU Temperature on Clevo NP60SND #75

dijia1124 · 2024-04-09T20:02:36Z

Machine: Metabox Prime-16 (Rebranded Clevo NP60SND / Tuxedo Gemini 16 Gen 2/ XMG FOCUS 16 E23)
CPU: 13900HX
GPU: RTX 4060 Laptop
Number of Fans: 2
OS: Arch Linux
BIOS: 1.07.18RTR4a from XMG
EC: 1.07.07TR3 from XMG

tailor_hwcaps:

[OK] Module version: "0.3.9\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
[OK] Device interface ID: "clevo_acpi\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
[ERR] Model ID: NotAvailable
[OK] Available ODM performance profiles: ["quiet", "power_saving", "entertainment", "performance"]
[OK] Default ODM performance profile: "performance"
[OK] Number of fans: 2
[OK] Fan temperatures [°C]: [44, 20]
[OK] Fan speeds [%]: [30, 22]
[OK] Fan min speed [%]: 20
[OK] Webcam enabled: true
[INFO] TDP control is not available
[OK] Number of LED devices: 1
[OK] LED device number: 0
[OK] LED device name: "platform:tuxedo_keyboard"
[OK] LED device function: "kbd_backlight"
[OK] LED mode: Rgb
[OK] LED device color: Ok(Color { r: 192, g: 97, b: 203 })

The problem arises with Fan 2, which is designated for GPU. During regular use, where the GPU workload is minimal, it seems that the Nvidia GPU enters a low-power state (sort of Optimus stuff). In this state, the GPU temperature cannot be often correctly read, causing Fan 2 to operate at incorrect speeds based on erroneous temperature data.

Upon observation, when I execute the command "nvidia-smi" to display Nvidia GPU information, Fan 2 is able to read the correct GPU temperature and adjust its speed accordingly. However, after some time, it tends to revert to reading incorrect temperature data, leading to Fan 2 operating inefficiently again.

It's unclear where the incorrect GPU temperature reading originates from, but it seems to be related to the GPU entering a low-power state during light usage.

For instance, according to the hwcaps output provided earlier, the temperature reading for Fan 2 remains consistently at 20 degrees for an extended period. Upon executing "nvidia-smi", I observe the correct GPU temperature, which is indicated as 35 degrees. Subsequently, upon re-running hwcaps, I notice that the Fan 2 temperature adjusts to reflect the accurate GPU temperature of 35 degrees. However, this correction is short-lived, as Fan 2 reverts back to displaying 20 degrees after a brief period.

In such scenarios, when the CPU is under heavy load, the GPU temperature can also rise due to the heat generated by the CPU. However, since the GPU is in an idle state, the GPU fan fails to obtain the latest temperature reading, resulting in diminished overall thermal performance.

Conversely, in other instances, the GPU temperature may be relatively low. Nonetheless, Fan 2 registers an anomalous value, such as 59 degrees. Following the predefined fan curve settings, Fan 2 initiates frantic spinning despite the actual GPU temperature being much lower.

Any insights or solutions to this issue would be greatly appreciated. Thank you!

AaronErhardt · 2024-04-09T20:11:45Z

Thanks for reporting this and already having a closer look at the problem. I think this goes beyond the scope of tuxedo-rs though. We just use what we get from the kernel driver and if that data is incorrect, that's a problem of the driver.

We could however offer a solution for ignoring certain fans entirely to allow users to use the default fan clock speeds.

dijia1124 · 2024-04-09T20:23:55Z

Thanks for reporting this and already having a closer look at the problem. I think this goes beyond the scope of tuxedo-rs though. We just use what we get from the kernel driver and if that data is incorrect, that's a problem of the driver.

We could however offer a solution for ignoring certain fans entirely to allow users to use the default fan clock speeds.

Thank you for your prompt response and understanding. Like what you metioned, it seems the issue lies within the driver domain. Offering a solution to ignore certain fans entirely within the tailord project sounds like a step in the right direction.

However, considering the specific behavior of my machine, the default fan control strategy implemented by the XMG embedded controller tends to prioritize quiet operation, resulting in insufficient airflow within the chassis during regular/light usage. This leads to fan speeds lower than optimal on the GPU side, causing elevated temperatures, particularly noticeable with SSD temperatures reaching around 60 degrees Celsius under low loads.

Given this scenario, I'm wondering if it might be feasible to incorporate a feature alongside the "ignoring certain fans" functionality, allowing users to specify a minimum fan speed for certain fans. This could help address the airflow imbalance issue experienced with my machine.

I understand that this issue might be unique to my particular machine, so I'm unsure if this suggestion aligns with the broader scope of the project. Nevertheless, I thought it was worth mentioning for consideration.

Thank you for your attention to this matter.

AaronErhardt · 2024-04-09T20:39:20Z

Given this scenario, I'm wondering if it might be feasible to incorporate a feature alongside the "ignoring certain fans" functionality, allowing users to specify a minimum fan speed for certain fans.

That might be already possible, but I have to look into the code to confirm. I think instead of adding a new minimum fan speed parameter, we could just use the lowest data point in the fan curve as a minimum. I'll have a closer look tomorrow.

AaronErhardt · 2024-04-10T09:17:20Z

Ok, so if I understand the code correctly, the first point in the fan curve will be used as value for all temperatures below it. So if the first point is { temp: 30°C, speed: 30% }, all temperatures below 30°C will also have 30% fan speed.

That should work for you, right?

dijia1124 · 2024-04-10T14:36:19Z

Ok, so if I understand the code correctly, the first point in the fan curve will be used as value for all temperatures below it. So if the first point is { temp: 30°C, speed: 30% }, all temperatures below 30°C will also have 30% fan speed.

That should work for you, right?

I can confirm that the approch provided for setting the minimum fan speed (first point in curve) has been effective.

However, there are instances where the system erroneously reads the GPU temperature as 59 degrees, triggering the fan to operate at a high speed according to the fan curve.
If I were to set the first point to 59 degrees to account for this, it would result in inadequate cooling when the GPU temperature genuinely reaches 59 degrees.

enriquezmark36 · 2024-04-11T01:54:21Z

Hello,

Please excuse my sudden intrusion in this thread. I believe this may be relevant to your topic

But I've also observed this behavior on my laptop (also clevo based with an nvidia dgpu) and I guess it's normal(?) The gpu fan (FAN2) temperature resets to 20C every time the GPU is turned off . I think this is intended so that the fans also turn off with the GPU.

Though, about the "system erroneously reads the GPU temparature", could it be just a misread? That is, it reads 59C then on its next reading (or the second next) shows the right temperature? If the temps are wrong when the GPU is turned off, I suspect that it could also be wrong if the the EC just happens to probe the temps at brief moments when the GPU has just been powered on where it might be just probing the CPU temp from the other side of the heatsink. In my machine, the GPU temp literally copies from my CPU temp up until 60C short moments after it is turned on by nvidia-smi. Maybe, using some history can prevent the sudden increase in fan speed.

Though about the High CPU load, no GPU load case (e.g., compiling a kernel). It is realistic that the CPU fans will be maxed out while the GPU fans aren't. Since the clevo chasis almost always share the same heatsink for both the CPU and the GPU, if not, are connected, it does make sense the the GPU fans can be turned on to alleviate the CPU temps as well. In that case, the older TCC actually just did the simplest thing and made a single fan curve, use the highest temp among the two against that fan curve, then set to both fans.

tuxedo-rs may take this a step further by only doing this thing when reaching 100% (or some threshold) of the highest point in the CPU or GPU fan curve and doing the so called "fan compensation" over a short period of time rather than instantly to prevent sudden jumps in fan speeds from the other (possibly idle and not running) fan.

Thanks,

dijia1124 · 2024-04-11T02:50:49Z

"system erroneously reads the GPU temparature"

In terms of "system erroneously reads the GPU temperature", I just read it from "tailor_hwcaps". From what I observed, the "wrong temp" would either be 20 or 59 degrees. And once it is wrong, it would keep showing the wrong value for a long time until we run "nvidia-smi" to somehow "activate" the dGPU.

enriquezmark36 · 2024-04-11T05:37:10Z

I've only observed my machine output 1 wrong value which is 20 degrees since I don't log the temperatures. I do think I saw back when I was using TCC that the temp often hangs at 59 degrees when the GPU is off but I haven't observed that with tuxedo-rs yet.

Anyway, it appears that this problem of not being able to properly read the temperature is because of the nvidia GPU being turned off and the EC reporting wrong value while the GPU is off.

As to how tuxedo-rs will be able to elegantly workaround the kernel driver's fault is unfortunately beyond me, yet. Sorry, I am not able to help with that. I was thinking that tuxedo-rs could detect that the GPU is off by checking the nvidia GPU's "power_state", or just ignore temperatures 20 and 59 degrees. Nevertheless, none of these approaches are elegant specially if you start considering uniwill-based machines which often only have 1 fan.

$ cat "/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power_state"
D3cold

dijia1124 · 2024-11-13T06:30:09Z

My temporary workaround these months has been disabling dGPU and setting the fan speed to a low speed so that it won't be too hot or too noisy

/etc/udev/rules.d/00-remove-nvidia.rules

# Remove NVIDIA USB xHCI Host Controller devices, if present
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x0c0330", ATTR{power/control}="auto", ATTR{remove}="1"

# Remove NVIDIA USB Type-C UCSI devices, if present
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x0c8000", ATTR{power/control}="auto", ATTR{remove}="1"

# Remove NVIDIA Audio devices, if present
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x040300", ATTR{power/control}="auto", ATTR{remove}="1"

# Remove NVIDIA VGA/3D controller devices
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x03[0-9]*", ATTR{power/control}="auto", ATTR{remove}="1"

dijia1124 changed the title ~~Fan 2 Unable to Properly Read Nvidia GPU Temperature on Clevo NP60SND (possibly Tuxedo Gemini 16 Gen 2)~~ Fan 2 Unable to Properly Read Nvidia GPU Temperature on Clevo NP60SND Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fan 2 Unable to Properly Read Nvidia GPU Temperature on Clevo NP60SND #75

Fan 2 Unable to Properly Read Nvidia GPU Temperature on Clevo NP60SND #75

dijia1124 commented Apr 9, 2024

AaronErhardt commented Apr 9, 2024

dijia1124 commented Apr 9, 2024

AaronErhardt commented Apr 9, 2024

AaronErhardt commented Apr 10, 2024

dijia1124 commented Apr 10, 2024 •

edited

Loading

enriquezmark36 commented Apr 11, 2024

dijia1124 commented Apr 11, 2024 •

edited

Loading

enriquezmark36 commented Apr 11, 2024 •

edited

Loading

dijia1124 commented Nov 13, 2024

Fan 2 Unable to Properly Read Nvidia GPU Temperature on Clevo NP60SND #75

Fan 2 Unable to Properly Read Nvidia GPU Temperature on Clevo NP60SND #75

Comments

dijia1124 commented Apr 9, 2024

AaronErhardt commented Apr 9, 2024

dijia1124 commented Apr 9, 2024

AaronErhardt commented Apr 9, 2024

AaronErhardt commented Apr 10, 2024

dijia1124 commented Apr 10, 2024 • edited Loading

enriquezmark36 commented Apr 11, 2024

dijia1124 commented Apr 11, 2024 • edited Loading

enriquezmark36 commented Apr 11, 2024 • edited Loading

dijia1124 commented Nov 13, 2024

dijia1124 commented Apr 10, 2024 •

edited

Loading

dijia1124 commented Apr 11, 2024 •

edited

Loading

enriquezmark36 commented Apr 11, 2024 •

edited

Loading