Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU instability when compiling shaders. #228

Closed
vercingetorx opened this issue Jul 15, 2023 · 15 comments
Closed

CPU instability when compiling shaders. #228

vercingetorx opened this issue Jul 15, 2023 · 15 comments

Comments

@vercingetorx
Copy link

vercingetorx commented Jul 15, 2023

I have never had an issue in the past but as of today my pc crashed while compiling shaders. Upon checking logs it is filled with:

Jul 14 18:33:46 [redacted] kernel: mce: [Hardware Error]: PROCESSOR 0:b0671 TIME 1689384826 SOCKET 0 APIC 38 microcode 113
Jul 14 18:33:46 [redacted] kernel: mce: [Hardware Error]: TSC 2a8e134418f ADDR 7f248265a908 
Jul 14 18:33:46 [redacted] kernel: mce: [Hardware Error]: CPU 14: Machine Check: 0 Bank 0: 8400010000060010
Jul 14 18:33:46 [redacted] kernel: mce: [Hardware Error]: Machine check events logged

These errors only show up when I am compiling shaders. I don't believe this is a CPU issue as not only is it a brand new 13900k, I can run Cinebench just fine without errors or instability.

@vercingetorx vercingetorx changed the title CPU instability when compiling on Linux. CPU instability when compiling shaders. Jul 15, 2023
@kisak-valve kisak-valve transferred this issue from ValveSoftware/Proton Jul 15, 2023
@kisak-valve
Copy link
Member

Hello @xioren, this reads like a hardware or microcode issue more than a userspace issue, but on the off chance it's not, I've transferred this issue report to the Fossilize issue tracker because that's what I think you're referring to with "compiling shaders" as the Steam's shader pre-caching sub-component instead of on demand by a game with Proton.

@kakra
Copy link
Contributor

kakra commented Jul 15, 2023

You should run sudo mcelog to get more details about this.

@vercingetorx
Copy link
Author

vercingetorx commented Jul 16, 2023

@kakra Unfortunately mcelog was dreprecated on Debian in favor of rasdaemon which currently has a bug with logging mces. I will post the output if and when I can. In the mean time, I also ran memtest and completed without error. So again despite my best efforts, I can only get these mce errors to show up when I am compiling shaders. Will update when I can with mce logs.

@kakra
Copy link
Contributor

kakra commented Jul 16, 2023

@xioren Maybe there is a hotspot on your CPU and fossilize is able to hit that. Maybe try lowering the power limit of your CPU, and if that helps, maybe re-apply thermal paste. But before taking those efforts, I'd look into the MCE results if the system doesn't behave strange in other scenarios. Some years ago, I found (always correctable) bit-flips in my CPU cache if it was overclocked. These were gone with a better CPU cooler.

@vercingetorx
Copy link
Author

vercingetorx commented Sep 28, 2023

Just to update I have verified that the cpu thermals remain consistent and within normal ranges during sustained 100% workloads across all cores. I have run Cinebench, memtest, and AI workloads without issue. I have also updated the motherboard bios firmware several times. Everything is rock solid (as it should be this is a brand new high end pc) and I am confident that I have ruled out hardware or thermal issues. Despite my best efforts I cannot get an mce error to occur in any scenario other than compiling shaders in Steam. It may be that this action produces a specific combination of instructions that causes instability. One possibility is a problem with the Intel micro code itself although I have no way of verifying or testing this. Unfortunately there is currently no way to log mce errors on Debian so I am stuck waiting for an update to rasdaemon (that may never come). I will update with any future developments.

@Alejandro9509
Copy link

Alejandro9509 commented Oct 1, 2023

I am also having a problem with this error

segfault at 18 ip 0000564f8a0a4a1e sp 00007ffe35b1a620 error 4 in fossilize_replay[564f8a07d000+199000] likely on CPU 2 (core 2, socket 0)

and the system becomes very slow to use it, I close steam and it works normally

I thought it was the processor but I applied several stress benchmarks for 5 minutes or 10 minutes and it didn't give any errors.

@kakra
Copy link
Contributor

kakra commented Oct 1, 2023

I am also having a problem with this error

This is probably a different issue. OP says the complete PC crashes. For you, just one process crashed - and it's just a segfault not a MCE error. "becomes slow" likely means it's low on memory, some processes get no more memory, and thus they crash with a segfault. There are different issues tracking memory usage of fossilize, one may be caused by a buggy version of mesa.

@vercingetorx
Copy link
Author

vercingetorx commented Feb 2, 2024

Okay it looks like rasdaemon is finally fixed and MCEs are getting logged.

MCE events:
1 2024-02-01 16:38:21 -0800 error: Instruction TLB Level-0 Error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8400014000060010, addr=0x562a74d0c4c0, tsc=0x6e83fb93883, walltime=0x65bc397d, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
2 2024-02-01 16:38:23 -0800 error: Instruction CACHE Level-0 Instruction-Fetch Error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000020150, tsc=0x6e969325e9f, walltime=0x65bc397f, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
3 2024-02-01 16:38:24 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6e9eeea9f5b, walltime=0x65bc3980, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
4 2024-02-01 16:38:25 -0800 error: Instruction TLB Level-0 Error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8400004000060010, addr=0x7f4184fa6750, tsc=0x6ea629f3371, walltime=0x65bc3981, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
5 2024-02-01 16:38:25 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6eaeba5d9c9, walltime=0x65bc3981, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
6 2024-02-01 16:38:25 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6eb017a529b, walltime=0x65bc3981, cpu=0x0000000c, cpuid=0x000b0671, apicid=0x00000030
7 2024-02-01 16:38:26 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6eb23ab9085, walltime=0x65bc3982, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
8 2024-02-01 16:38:26 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6eb8d161d45, walltime=0x65bc3982, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
9 2024-02-01 16:38:26 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6eb8e75c6df, walltime=0x65bc3982, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
10 2024-02-01 16:38:27 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6ebee9a9a3b, walltime=0x65bc3983, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
11 2024-02-01 16:38:27 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6ebeec61a95, walltime=0x65bc3983, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
12 2024-02-01 16:38:28 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6ec8b0cb821, walltime=0x65bc3984, cpu=0x00000008, cpuid=0x000b0671, apicid=0x00000020
13 2024-02-01 16:38:28 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6ed1c08fee1, walltime=0x65bc3984, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
14 2024-02-01 16:38:29 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6ed38e6d489, walltime=0x65bc3985, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
15 2024-02-01 16:38:29 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6ed68bb570f, walltime=0x65bc3985, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
16 2024-02-01 16:38:29 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6ed68dfc8e3, walltime=0x65bc3985, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
17 2024-02-01 16:38:29 -0800 error: Instruction CACHE Level-0 Instruction-Fetch Error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000020150, tsc=0x6eda083a0f3, walltime=0x65bc3985, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
18 2024-02-01 16:38:33 -0800 error: Instruction TLB Level-0 Error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8400004000060010, addr=0x7f41852004e4, tsc=0x6f034805e97, walltime=0x65bc3989, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
19 2024-02-01 16:38:33 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6f0366acaf7, walltime=0x65bc3989, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
20 2024-02-01 16:38:33 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6f0383edc7f, walltime=0x65bc3989, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
21 2024-02-01 16:38:33 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6f038450693, walltime=0x65bc3989, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
22 2024-02-01 16:38:33 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6f0384a5153, walltime=0x65bc3989, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
23 2024-02-01 16:38:33 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6f0384bfce7, walltime=0x65bc3989, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
24 2024-02-01 16:38:33 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6f0385353cd, walltime=0x65bc3989, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
25 2024-02-01 16:38:33 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6f03865cd45, walltime=0x65bc3989, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
26 2024-02-01 16:38:33 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6f038676e0b, walltime=0x65bc3989, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
27 2024-02-01 16:38:33 -0800 error: Instruction TLB Level-0 Error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8400008000060010, addr=0x7f4185202bd8, tsc=0x6f07b897cbf, walltime=0x65bc3989, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
28 2024-02-01 16:38:35 -0800 error: Instruction CACHE Level-0 Instruction-Fetch Error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000020150, tsc=0x6f1e10df52f, walltime=0x65bc398b, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
29 2024-02-01 16:38:36 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6f21ca92949, walltime=0x65bc398c, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
30 2024-02-01 16:38:36 -0800 error: Instruction TLB Level-0 Error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8400010000060010, addr=0x7f4185922aa0, tsc=0x6f2b2539bcb, walltime=0x65bc398c, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
31 2024-02-01 16:38:37 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6f31dccdf47, walltime=0x65bc398d, cpu=0x0000000c, cpuid=0x000b0671, apicid=0x00000030
32 2024-02-01 16:38:39 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6f45a4bf0e9, walltime=0x65bc398f, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
33 2024-02-01 16:38:39 -0800 error: Instruction TLB Level-0 Error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8400004000060010, addr=0x7f4185931040, tsc=0x6f473675171, walltime=0x65bc398f, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
34 2024-02-01 16:38:39 -0800 error: Instruction CACHE Level-0 Instruction-Fetch Error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000020150, tsc=0x6f4bdabd69d, walltime=0x65bc398f, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038
35 2024-02-01 16:38:40 -0800 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error, mcgcap=0x00000c16, status=0x8000004000050005, tsc=0x6f5318c7c1d, walltime=0x65bc3990, cpu=0x0000000e, cpuid=0x000b0671, apicid=0x00000038

All these while compiling shaders in Steam. They do not occur in any other context.

@kakra
Copy link
Contributor

kakra commented Feb 2, 2024

All these while compiling shaders in Steam. They do not occur in any other context.

Overclocked or undervolted? Try lowering clocks a little or raise undervolting a little, or disable unlimited power limits (if your board says 4095W, it's unlimited, many boards come with that by default, set PL1 and PL2 according to your processor specs instead).

BTW: All errors were corrected. But your CPU is either of low quality (not like in "bad and broken" but more like "cannot take so much overclocking") or under thermal stress.

@vercingetorx
Copy link
Author

vercingetorx commented Feb 2, 2024

Yes they were corrected and coincidentally my PC did not crash; it often does while compiling so potentially not always corrected. The CPU is a i9-13900k and I have not overclocked (or undervolted) anything. If thermal issues can cause these kinds of errors then my next step is to try re-applying thermal paste and hope that helps. I suppose I will do that and update with results.

@kakra
Copy link
Contributor

kakra commented Feb 2, 2024

XMP memory profiles may also cause this because it overclocks the memory controller in the CPU. OTOH, maybe these errors are more or less bogus and it is normal to find and correct some bitflips - as long as it does this reliably, there's not a real problem. But given the amount of errors in a short time period, I'd tend to say: That's not normal.

Here's an old post that suggests that some CPU platforms may spuriously generate these errors and they can be safely ignored: https://forums.centos.org/viewtopic.php?t=66117. Maybe you can find something similar about your CPU model? But even then, this feels like too many errors for spurious events.

From my own experience, I can only tell that I could fix it by installing a proper sized cooler (which of cause means I also applied fresh thermal paste).

Other things to check: PSU? Voltage stability? Airflow?

Also check if these errors cause CPU cores to become disabled (cat /proc/cpuinfo) - if this happens, your CPU is likely faulty or protects its own stability from another misbehaving system component.

If during fossilize the errors pile up randomly in batches, it's most likely a heat issue. You can try running stress-ng or other CPU burn-in tests to see if you can replicate the issue there. If you're not using the iGPU, try turning off the iGPU in your BIOS (I think disabling multimonitor IGD function does this), which will free some thermal capacity.

@kakra
Copy link
Contributor

kakra commented Feb 2, 2024

and I have not overclocked (or undervolted) anything

BTW: Unlimited power limits is a type of overclocking... (PL1 and PL2, 4095 = unlimited). If the problem occurs during prolonged high multicore CPU loads, you may need to adjust the load line calibration so the CPU raises the core voltage under high frequencies more (stabilizes the signals). But this needs proper cooling, you can instead lower the PL1 and PL2 (continuous power consumption, burst power consumption) and adjust TAU (time limit for PL2). The 13900k should have both PL1 and PL2 set to 253 watts (so TAU doesn't matter). You can usually lower that by up to 60% without a huge performance impact. You can raise that for unproportionally more power consumption for a little more performance. That needs a cooler that, ideally, can transport 250W of heat away from the CPU (usually, something like 190W is more realistic).

If your system is water cooled, ensure that no air bubbles pile up in the water block (put the reservoir/radiator pipe connections above the CPU water block).

@vercingetorx
Copy link
Author

I just came across this article that likely explains the issue. I have not been able to test the "fixes" yet but will update when I do.

Relevant quote from article:

Complaints about stability issues on the 13900K and 13700K aren't exactly widespread, but have become particularly concentrated on a handful of games, specifically for those with shader compilation.

@kakra
Copy link
Contributor

kakra commented Feb 24, 2024

I just came across this article that likely explains the issue. I have not been able to test the "fixes" yet but will update when I do.

Yeah, this is most likely the issue here and explains a lot. Many "gaming" mainboards often come with unlocked power limits by default, and this may be the biggest contributor to the problem. I'd even say try setting your power limits to the CPU defaults first before lowering the frequency multiplicator, because lowering the power limited loses less performance at a nice power consumption reduction, while lowering the multiplicator will harm performance a lot more without much reduction in power usage.

Non-K CPUs should not be affected because PL1 and PL2 are not unlocked there - as far as I know. If you buy a K-CPU, you're on your own taming the beast - because it's unlocked... You need to adjust it to the limits your system can handle and the maximum stress each core can handle.

A BIOS usually allows to adjust the maximum boost depending on how many cores are boosting at the same time (lower boost on multiple cores vs higher boost on a single core). If the problem happens on specific cores, you can also reduce the boost of this core. This may be a way to go instead of reducing the overall multiplier. I'd still try PL1/PL2/TAU first...

If it works with 253W (or even lower) PL1/PL2, you can then slowly raise the PL2 first (for burst power limit) and play around with TAU (for duration of the burst), then slowly raise PL1. The higher you go with the power limit, the less is the performance gain, so there's not much sense in trying to adjust it for the maximum possible stable peak - usually, you should not go much higher than 253W on 13900K. It makes no real difference in normal gaming, you only see the difference in synthetic benchmarks.

Repeating from above and to put it in other words, this is my gut feeling: The processor is designed for the high peak frequencies, but it is not designed to do that at unlimited power consumption, and especially not for prolonged periods of time. And this is also where appropriate cooling for your settings comes into play.

@vercingetorx
Copy link
Author

I changed the turbo power limit from auto (unlimited) to Intel POR. I was just able to compile the shaders for Halo: The Master Chief Collection with no mce errors. That was a title that gave errors every time while compiling. Hopefully this will hold true for all games. Anyway I think I can finally close this, thank your for your help and insight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants