-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rasdaemon does not log MCE #95
Comments
I am having a similar problem with similar hardware/software: rasdaemon v0.6.6 (From Debian, deb http://ftp.us.debian.org/debian bullseye) on Kernel 6.2 (Proxmox 7.4, 6.2.11-1-pve); ASRock X570D4U-2L2T; and AMD Ryzen 5950X. root@pve:~# dmesg -T | grep -i mce
[Wed May 10 07:29:55 2023] MCE: In-kernel MCE decoding enabled.
[Wed May 10 08:06:13 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 08:37:21 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 09:13:40 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 09:44:48 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 10:21:07 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 10:52:15 2023] mce: [Hardware Error]: Machine check events logged root@pve:~# ras-mc-ctl --summary
No Memory errors.
No PCIe AER errors.
No Extlog errors.
No devlink errors.
No disk errors.
No MCE errors. root@pve:~# ras-mc-ctl --status
ras-mc-ctl: drivers are loaded. root@pve:~# ras-mc-ctl --status
ras-mc-ctl: drivers are loaded.
root@pve:~# systemctl status rasdaemon.service
● rasdaemon.service - RAS daemon to log the RAS events
Loaded: loaded (/lib/systemd/system/rasdaemon.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2023-05-10 07:29:57 EDT; 3h 36min ago
Process: 2587 ExecStartPost=/usr/sbin/rasdaemon --enable (code=exited, status=0/SUCCESS)
Main PID: 2582 (rasdaemon)
Tasks: 1 (limit: 154393)
Memory: 15.2M
CPU: 33ms
CGroup: /system.slice/rasdaemon.service
└─2582 /usr/sbin/rasdaemon -f -r
May 10 07:29:57 pve rasdaemon[2582]: Enabled event ras:arm_event
May 10 07:29:57 pve rasdaemon[2582]: mce:mce_record event enabled
May 10 07:29:57 pve rasdaemon[2582]: Enabled event mce:mce_record
May 10 07:29:57 pve rasdaemon[2582]: ras:extlog_mem_event event enabled
May 10 07:29:57 pve rasdaemon[2582]: Enabled event ras:extlog_mem_event
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording mc_event events
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording aer_event events
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording extlog_event events
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording mce_record events
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording arm_event events root@pve:~# systemctl status ras-mc-ctl
● ras-mc-ctl.service - Initialize EDAC v3.0.0 Drivers For Machine Hardware
Loaded: loaded (/lib/systemd/system/ras-mc-ctl.service; enabled; vendor preset: enabled)
Active: active (exited) since Wed 2023-05-10 07:29:57 EDT; 3h 37min ago
Process: 2574 ExecStart=/usr/sbin/ras-mc-ctl --register-labels (code=exited, status=0/SUCCESS)
Main PID: 2574 (code=exited, status=0/SUCCESS)
CPU: 20ms
May 10 07:29:57 pve systemd[1]: Starting Initialize EDAC v3.0.0 Drivers For Machine Hardware...
May 10 07:29:57 pve ras-mc-ctl[2574]: ras-mc-ctl: Error: No dimm labels for ASRockRack model X570D4U-2L2T
May 10 07:29:57 pve systemd[1]: Finished Initialize EDAC v3.0.0 Drivers For Machine Hardware. I receive the errors only if I am running a VM (generally TrueNas Scale), and the errors occur as follows: first error occurs after 31 minutes, 19 seconds; then second error occurs after 36 minutes, 8 seconds; then third error occurs after 31 minutes, 19 seconds; then fourth error occurs after 36 minutes, 8 seconds; and so on until I shutdown the VM and/or the Host. EDIT: @robinchrist, your problem appears to occur approximately every 5 minutes, 11 seconds. |
@PastramiKing do you have any memory OC running? I did some experimental memory OC and the errors disappeared when I returned to stock, so I assume those were memory ECC errors. |
Same thing on Ubuntu server 22.04. Syslog have ecc errors: May 17 16:27:25 ecc kernel: [ 316.509297] mce: [Hardware Error]: Machine check events logged But rasdaemon (v0.6.7) says "Hey, all fine, no errors." sudo ras-mc-ctl --errors No PCIe AER errors. No Extlog errors. No MCE errors. sudo ras-mc-ctl --error-count |
I also get non recorded mce errors roughly every 5 minutes. ~# systemctl status rasdaemon Jul 09 14:55:22 pverdrmain rasdaemon[3150285]: rasdaemon: Enabled event ras:extlog_mem_event ~# systemctl status ras-mc-ctl Jul 09 14:55:23 pverdrmain systemd[1]: Starting Initialize EDAC v3.0.0 Drivers For Machine Hardware... [Mon Jul 10 11:19:13 2023] mce: [Hardware Error]: Machine check events logged ~# ras-mc-ctl --errors No PCIe AER errors. No Extlog errors. No devlink errors. No disk errors. No MCE errors. Did anyone find out what is causing the errors and why rasdaemon 0.6.6 on debian is broken? Is there a way to install rasdaemon 0.8 on debian 11? |
@robinchrist I also have this board with these errors. |
yes 5 minutes 11 seconds. I have the same on ASRock rack X470D4U2-2T with Ryzen 5 2600 using ECC memory. I also tried this on debian 12 with the latest rasdaemon available there. Still the same just as @robinchrist. @mchehab Is there are more recent version I could try? or perhaps enable some debugging options that could help out? |
I think I might be on to something regarding the 5 minutes (and in our case 11 seconds) interval. I could be totally wrong though, just drawing attention to it so that more knowledgeable people can decide if it is relevant or not. |
on debian 11, or rather proxmox 7.4 based on debian 11, the edac_mce_amd module is not loaded by default. If have got that module loaded now but still rasdaemon is not recording MCE errors |
I wanted to report that i am also experiencing this constellation of symptoms: x570d4u-2l2t, ryzen 9 5900X and 128GB of ECC RAM. Nothing is overclocked. Proxmox: latest, running kernel 6.2.16-8-pve. Rasdaemon is installed (and patched against the sqlite bug) but does not report any errors. My errors are slightly more intermittent - but once the start they also have a 5minutes cadence:
Eventually the machine becomes unstable and needs to reboot, but usually this takes a few days. |
Can you elaborate on what you mean by unstable?I suggest running memtest86 for 24 hours and see if it repots errors.On 22 Aug 2023, at 06:05, DigiDr ***@***.***> wrote:
I wanted to report that i am also experiencing this constellation of symptoms: x570d4u-2l2t, ryzen 9 5900X and 128GB of ECC RAM. Nothing is overclocked.
Proxmox: latest, running kernel 6.2.16-8-pve.
Rasdaemon is installed (and patched against the sqlite bug) but does not report any errors.
My errors are slightly more intermittent - but once the start they also have a 5minutes cadence:
[ 316.956399] mce: [Hardware Error]: Machine check events logged [ 628.242421] mce: [Hardware Error]: Machine check events logged [ 939.532553] mce: [Hardware Error]: Machine check events logged [ 1250.822880] mce: [Hardware Error]: Machine check events logged [ 1562.113834] mce: [Hardware Error]: Machine check events logged [ 1873.404440] mce: [Hardware Error]: Machine check events logged [ 2184.694951] mce: [Hardware Error]: Machine check events logged [ 2495.986726] mce: [Hardware Error]: Machine check events logged [ 2807.277286] mce: [Hardware Error]: Machine check events logged [ 3118.567873] mce: [Hardware Error]: Machine check events logged [ 3429.858463] mce: [Hardware Error]: Machine check events logged [ 3741.149033] mce: [Hardware Error]: Machine check events logged [ 4052.439293] mce: [Hardware Error]: Machine check events logged [ 4363.729819] mce: [Hardware Error]: Machine check events logged [ 4675.020201] mce: [Hardware Error]: Machine check events logged [ 4986.310628] mce: [Hardware Error]: Machine check events logged [ 5297.601098] mce: [Hardware Error]: Machine check events logged
Eventually the machine becomes unstable and needs to reboot, but usually this takes a few days.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
I do not believe checking memory is the next step. I have a faulty memory module that I use to trigger memory related ECC messages. And on my setup they are reported in a separate catagory as MCE errors do. I think this is CPU related or perhaps a systemic issue with asrock rack motherboards. Anyway I tried getting through to AMD for technical support but that is rather difficult. I am not sure how to proceed now. Does anyone have an email of AMD tech support? |
I don’t believe these are memory issues. My ECC ram would report these errors independently and the timing makes no sense. But we need rasdaemon to expose it in any case. As for unstable: VMs seem to hang after a few days, and this is the only link I can make to that behaviour - and the reason for investigating this. |
Given the cluster of reports with the same asrockrack boards, they should be the first line of inquiry. But we need rasdaemon to expose these errors properly. |
On debian 12 the latest version of rasdaemon is 0.6.8 but as reported here that is still buggy on debian with ryzen cpu and/or asrock rack motherboards. So I am trying to get rasdaemon 0.8.0.x installed on debian 12 in an effort to shed some more light on these errors but I am not experienced enough to pull it off. here I found the official source code for 0.8.0 But only compile and install instructions for fedora it seems. And that slams into walls on debian.
notice the user mchehab:mchehab not being found. Looks like hard coded user names in source? Anyway I am way out of my league here. Can anyone please point me into the right direction? |
or i could give nixos a try using a separate external usb as drive to install on. nixos seems to be able install any package at any version. Including 0.8 would that be worth the trouble? |
@githubDiversity I think i've managed it on debian (proxmox):
This left me on the latest version with all feature flags enabled. I'll let you know what i discover about these errors. --localstatedir=/var forces it to use the default location for debian.
You can alternatively compile with some/other options from /configure --help if you don't want the whole lot enabled. |
Alas, this still hasn't revealed the source of the mce event
|
thank you @DigiDr for showing how to install from source on Debian. Such a bummer that it lead no where. I also have an Asrock rack X470D4U board with an AMD Ryzen 5 2600 Pro on it. On that setup I get no 5 min 11 seconds MCE errors. If the errors then disappear then I think I have confirmed it is the CPU and not the board. Or might that not be the correct conclusion? Anyway, here is a link to a thread with what I think is a related phenomenon. You might try loading the edac_mce_amd module and see if that enables rasdaemon 0.8 to make sense of the errors. One final thing I will try is install windows server 2022 in the hopes that will uncover what is going on every 5 minutes and 11 seconds. After that I am at my whits end and would hope @mchehab could pitch in. |
Dear all, Therefore I think the solution to your problems could be to change one or more of the obscure RAS/MCA related firmware configuration options. |
but if it shows up in the kernel logs (which it does, because otherwise we wouldn't know that rasdaemon doesn't report them), shouldn't rasdaemon be able to make sense of it? Or can it be that the machine just reports "there was some error" to the OS but no additional information about what exactly etc? Maybe some expert can jump in and help |
This is exactly what I am suspecting. |
I am fighting to get windows installed. damed that is hard on bare metal these days ;( anyway can you guys please check the voltage level of the onboard battery? you can find that in the overview page of the IPMI interface. Mine is at 0.0V and I noticed also battery low erros. Not sure if it is related though |
ok I am getting closer (giving up on installing windows for the time being. it's too difficult grrr) So it turns out that tracing was not enabled by default on debian 11 (proxmox 7.3) cat /sys/kernel/debug/tracing/events/mce/mce_record/enable after I enabled it I get this in the trace tracer: nopentries-in-buffer/entries-written: 4/4 #P:12_-----=> irqs-off/BH-disabled/ _----=> need-resched| / _---=> hardirq/softirq|| / _--=> preempt-depth||| / _-=> migrate-disable|||| / delayTASK-PID CPU# ||||| TIMESTAMP FUNCTION| | | ||||| | |
systemctl status rasdaemon Aug 31 20:48:20 rasdaemon[228553]: rasdaemon: Recording extlog_event events But still rasdaemon is not recording those errors, let alone decoding them into human readable format. Is not this a serious issue that if tracing is not enabled by default then a lot of people might feel covered by rasdaemon while they are not? |
thanks to @DigiDr 's compile instructions I am now running rasdaemon 0.8 tracing is enabled en the trace is being populated. But still rasdaemon is not playing ball No PCIe AER errors. No ARM processor errors. No Extlog errors. No devlink errors. No disk errors. No Memory failure errors. No MCE errors. I think now is the time to escalate to @mchehab. rasdaemon seems seriously broken for several versions already. |
rasdaemon silently crashes with a segmentation fault after a few hours running in the foreground. ` Segmentation fault ` but the rasdaemon sqlite.db remains unchanged /usr/local/sbin/ras-mc-ctl --status /usr/local/sbin/ras-mc-ctl --errors No PCIe AER errors. No ARM processor errors. No Extlog errors. No devlink errors. No disk errors. No Memory failure errors. No MCE errors. From the output of running rasdaemon in the foreground I've noticed rasdaemon tried listening, opening files for, cpu's that do not exist. I have the Ryzen 5 2600 with 6 core / 12 threads. Not sure why it tried to do things with cpus up to 31. ` ` |
I no longer think this is the best place to discuss our issues as rasdaemon, although not working as expected, is no longer related. How I came to that conclusing below. If one wants to follow my progress as I pint down the exact cause please DM me and I will create a thread on the asrock rack support forums. tried running fedora 38 kde plasma from a live usb for a while installed rasdaemon 0.8.1 same 5 minutes 11 seconds thing but still rasdaemon seems broken. nothing recorded even after I enabled tracing. Same tracing info as on debian 12. I replaced the CPU with a ryzen 5 2600 pro but still the same 5 min 11 seconds mce errors. So now I will start removing PCI connected devices and see if that changes things but the progress I will not share here as not related to rasdaemon. |
I am not well versed in this site. Anyway I have gone and opened a support ticket at asrockrack. Since these are server products the asrock forum is not the place to ask questions |
As per usual Asrock Rack tech support is interested to help out and we are now in the process of digging down. But I really think that as soon as linux systems uses rasdaemon then it should work out of the box or at least tell the admin why it will probably not work. |
That's interesting - what response did you get from tech support?On 5 Sep 2023, at 19:13, githubDiversity ***@***.***> wrote:
As per usual Asrock Rack tech support is interested to help out and we are now in the process of digging down.
But I really think that as soon as linux systems uses rasdaemon then it should work out of the box or at least tell the admin why it will probably not work.
@mchehab I really think we need your involvement right now. this is no longer just an isolated case.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
like i stated earlier I do not believe this is the correct place to discuss mobo related issues. All I hope for is that the issue regarding rasdaemon seemingly being broken for a long time already gets the needed attention it needs as many people run it. |
any news on the rasdaemon side? I mean this issue is still open without (as far as I can tell) a response from the maintainers. If that is correct then I am really worried about the state of open source operating systems. I mean most of them have this piece of software as the default. |
Asrock Netherlands asked me to jump through a certain amount of hoops before they would relay our board (not rasdaemon) issue to the engineers in taiwan. I jumped through all of them and then I got told by asrock Netherlands to just simply send back the board to the reseller. Well that is asking alot after <i have rendered 2 of my cpu's useless by upgrading the bios as they asked. I have opened a post on the asrock forum. I am sure people will say there that this is the asrtock forum and not asrock rack. blah blah. That is why I did not open yet with the actual board type. https://forum.asrock.com/forum_posts.asp?TID=26816&PID=107896𚕸 |
What is the status here if I may ask? Are your board issues resolved Is rasdaemon by now able to report as expected? |
Different company, of course they will.
|
As of kernel 6.5 I am getting reports in dmesg WITHOUT any additional configuration of rasdaemon |
I had this issue bookmarked when I was fixing the ECC reporting on my consumer AM5 B650 motherboard and after I run:
, ras started reporting it. If after the services are started and still no ECC reporting, then I'd expect it to be a mobo/BIOS issue. If you still have issues, then I'd prob just sell the mobo, get a SUPERMICRO one and move on. Hardware: ASUS TUF GAMING B650-PLUS, 7800X3D and 2xKSM48E40BD8KM-32HM. |
Yeah, supermicro is starting to sound more palletable. But let's focus on rasdaemon. I think I have some new info. It seems that the watchdog timer on our boards is bonkers. |
Hi,
I'm using rasdaemon v0.6.8 (From Debian, https://packages.debian.org/de/bookworm/rasdaemon) on Kernel 5.15 (Proxmox 7.4,
5.15.102-1-pve
) and ASRock X570D4U-2L2T + AMD Ryzen 5950X.I do get some MCEs in the kernel log:
but
ras-mc-ctl
doesn't report anything:Everything seems to be running fine:
Any ideas on how this could be debugged?
The text was updated successfully, but these errors were encountered: