-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU instability when compiling shaders. #228
Comments
Hello @xioren, this reads like a hardware or microcode issue more than a userspace issue, but on the off chance it's not, I've transferred this issue report to the Fossilize issue tracker because that's what I think you're referring to with "compiling shaders" as the Steam's shader pre-caching sub-component instead of on demand by a game with Proton. |
You should run |
@kakra Unfortunately |
@xioren Maybe there is a hotspot on your CPU and fossilize is able to hit that. Maybe try lowering the power limit of your CPU, and if that helps, maybe re-apply thermal paste. But before taking those efforts, I'd look into the MCE results if the system doesn't behave strange in other scenarios. Some years ago, I found (always correctable) bit-flips in my CPU cache if it was overclocked. These were gone with a better CPU cooler. |
Just to update I have verified that the cpu thermals remain consistent and within normal ranges during sustained 100% workloads across all cores. I have run Cinebench, memtest, and AI workloads without issue. I have also updated the motherboard bios firmware several times. Everything is rock solid (as it should be this is a brand new high end pc) and I am confident that I have ruled out hardware or thermal issues. Despite my best efforts I cannot get an mce error to occur in any scenario other than compiling shaders in Steam. It may be that this action produces a specific combination of instructions that causes instability. One possibility is a problem with the Intel micro code itself although I have no way of verifying or testing this. Unfortunately there is currently no way to log mce errors on Debian so I am stuck waiting for an update to rasdaemon (that may never come). I will update with any future developments. |
I am also having a problem with this error segfault at 18 ip 0000564f8a0a4a1e sp 00007ffe35b1a620 error 4 in fossilize_replay[564f8a07d000+199000] likely on CPU 2 (core 2, socket 0) and the system becomes very slow to use it, I close steam and it works normally I thought it was the processor but I applied several stress benchmarks for 5 minutes or 10 minutes and it didn't give any errors. |
This is probably a different issue. OP says the complete PC crashes. For you, just one process crashed - and it's just a segfault not a MCE error. "becomes slow" likely means it's low on memory, some processes get no more memory, and thus they crash with a segfault. There are different issues tracking memory usage of fossilize, one may be caused by a buggy version of mesa. |
Okay it looks like rasdaemon is finally fixed and MCEs are getting logged.
All these while compiling shaders in Steam. They do not occur in any other context. |
Overclocked or undervolted? Try lowering clocks a little or raise undervolting a little, or disable unlimited power limits (if your board says 4095W, it's unlimited, many boards come with that by default, set PL1 and PL2 according to your processor specs instead). BTW: All errors were corrected. But your CPU is either of low quality (not like in "bad and broken" but more like "cannot take so much overclocking") or under thermal stress. |
Yes they were corrected and coincidentally my PC did not crash; it often does while compiling so potentially not always corrected. The CPU is a i9-13900k and I have not overclocked (or undervolted) anything. If thermal issues can cause these kinds of errors then my next step is to try re-applying thermal paste and hope that helps. I suppose I will do that and update with results. |
XMP memory profiles may also cause this because it overclocks the memory controller in the CPU. OTOH, maybe these errors are more or less bogus and it is normal to find and correct some bitflips - as long as it does this reliably, there's not a real problem. But given the amount of errors in a short time period, I'd tend to say: That's not normal. Here's an old post that suggests that some CPU platforms may spuriously generate these errors and they can be safely ignored: https://forums.centos.org/viewtopic.php?t=66117. Maybe you can find something similar about your CPU model? But even then, this feels like too many errors for spurious events. From my own experience, I can only tell that I could fix it by installing a proper sized cooler (which of cause means I also applied fresh thermal paste). Other things to check: PSU? Voltage stability? Airflow? Also check if these errors cause CPU cores to become disabled ( If during fossilize the errors pile up randomly in batches, it's most likely a heat issue. You can try running stress-ng or other CPU burn-in tests to see if you can replicate the issue there. If you're not using the iGPU, try turning off the iGPU in your BIOS (I think disabling multimonitor IGD function does this), which will free some thermal capacity. |
BTW: Unlimited power limits is a type of overclocking... (PL1 and PL2, 4095 = unlimited). If the problem occurs during prolonged high multicore CPU loads, you may need to adjust the load line calibration so the CPU raises the core voltage under high frequencies more (stabilizes the signals). But this needs proper cooling, you can instead lower the PL1 and PL2 (continuous power consumption, burst power consumption) and adjust TAU (time limit for PL2). The 13900k should have both PL1 and PL2 set to 253 watts (so TAU doesn't matter). You can usually lower that by up to 60% without a huge performance impact. You can raise that for unproportionally more power consumption for a little more performance. That needs a cooler that, ideally, can transport 250W of heat away from the CPU (usually, something like 190W is more realistic). If your system is water cooled, ensure that no air bubbles pile up in the water block (put the reservoir/radiator pipe connections above the CPU water block). |
I just came across this article that likely explains the issue. I have not been able to test the "fixes" yet but will update when I do. Relevant quote from article:
|
Yeah, this is most likely the issue here and explains a lot. Many "gaming" mainboards often come with unlocked power limits by default, and this may be the biggest contributor to the problem. I'd even say try setting your power limits to the CPU defaults first before lowering the frequency multiplicator, because lowering the power limited loses less performance at a nice power consumption reduction, while lowering the multiplicator will harm performance a lot more without much reduction in power usage. Non-K CPUs should not be affected because PL1 and PL2 are not unlocked there - as far as I know. If you buy a K-CPU, you're on your own taming the beast - because it's unlocked... You need to adjust it to the limits your system can handle and the maximum stress each core can handle. A BIOS usually allows to adjust the maximum boost depending on how many cores are boosting at the same time (lower boost on multiple cores vs higher boost on a single core). If the problem happens on specific cores, you can also reduce the boost of this core. This may be a way to go instead of reducing the overall multiplier. I'd still try PL1/PL2/TAU first... If it works with 253W (or even lower) PL1/PL2, you can then slowly raise the PL2 first (for burst power limit) and play around with TAU (for duration of the burst), then slowly raise PL1. The higher you go with the power limit, the less is the performance gain, so there's not much sense in trying to adjust it for the maximum possible stable peak - usually, you should not go much higher than 253W on 13900K. It makes no real difference in normal gaming, you only see the difference in synthetic benchmarks. Repeating from above and to put it in other words, this is my gut feeling: The processor is designed for the high peak frequencies, but it is not designed to do that at unlimited power consumption, and especially not for prolonged periods of time. And this is also where appropriate cooling for your settings comes into play. |
I changed the turbo power limit from auto (unlimited) to Intel POR. I was just able to compile the shaders for Halo: The Master Chief Collection with no mce errors. That was a title that gave errors every time while compiling. Hopefully this will hold true for all games. Anyway I think I can finally close this, thank your for your help and insight. |
I have never had an issue in the past but as of today my pc crashed while compiling shaders. Upon checking logs it is filled with:
These errors only show up when I am compiling shaders. I don't believe this is a CPU issue as not only is it a brand new 13900k, I can run Cinebench just fine without errors or instability.
The text was updated successfully, but these errors were encountered: