Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rasdaemon does not log MCE #95

Open
robinchrist opened this issue Apr 24, 2023 · 37 comments
Open

rasdaemon does not log MCE #95

robinchrist opened this issue Apr 24, 2023 · 37 comments

Comments

@robinchrist
Copy link

Hi,

I'm using rasdaemon v0.6.8 (From Debian, https://packages.debian.org/de/bookworm/rasdaemon) on Kernel 5.15 (Proxmox 7.4, 5.15.102-1-pve) and ASRock X570D4U-2L2T + AMD Ryzen 5950X.

I do get some MCEs in the kernel log:

root@pve:~# dmesg | grep -i mce
[    0.644337] mce: [Hardware Error]: Machine check events logged
[    0.644338] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 17: dc2040000000011b
[    0.644342] mce: [Hardware Error]: TSC 0 ADDR a8eb3fc80 MISC d01202dd01000000 SYND 88e00040a800200 IPID 9600050f00 
[    0.644345] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1682293811 SOCKET 0 APIC 0 microcode a201009
[    4.768515] MCE: In-kernel MCE decoding enabled.
[  310.396113] mce: [Hardware Error]: Machine check events logged
[  316.656894] mce: [Hardware Error]: Machine check events logged
[  627.947258] mce: [Hardware Error]: Machine check events logged
[  939.240972] mce: [Hardware Error]: Machine check events logged
[ 1250.534814] mce: [Hardware Error]: Machine check events logged
[ 1561.828702] mce: [Hardware Error]: Machine check events logged
[ 1873.122720] mce: [Hardware Error]: Machine check events logged

but ras-mc-ctl doesn't report anything:

root@pve:~# ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No MCE errors.

Everything seems to be running fine:

root@pve:~# ras-mc-ctl --status
ras-mc-ctl: drivers are loaded.
root@pve:~# systemctl status rasdaemon.service 
● rasdaemon.service - RAS daemon to log the RAS events
     Loaded: loaded (/lib/systemd/system/rasdaemon.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2023-04-24 01:50:17 CEST; 32min ago
    Process: 1013 ExecStartPost=/usr/sbin/rasdaemon --enable (code=exited, status=0/SUCCESS)
   Main PID: 1012 (rasdaemon)
      Tasks: 1 (limit: 154399)
     Memory: 15.3M
        CPU: 24ms
     CGroup: /system.slice/rasdaemon.service
             └─1012 /usr/sbin/rasdaemon -f -r

Apr 24 01:50:17 pve rasdaemon[1012]: Enabled event mce:mce_record
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: ras:extlog_mem_event event enabled
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: Enabled event ras:extlog_mem_event
Apr 24 01:50:17 pve rasdaemon[1012]: ras:extlog_mem_event event enabled
Apr 24 01:50:17 pve rasdaemon[1012]: Enabled event ras:extlog_mem_event
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: Listening to events for cpus 0 to 31
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: Recording mc_event events
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: Recording aer_event events
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: Recording extlog_event events
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: Recording mce_record events

root@pve:~# systemctl status ras
rasdaemon.service   ras-mc-ctl.service  
root@pve:~# systemctl status ras-mc-ctl.service 
● ras-mc-ctl.service - Initialize EDAC v3.0.0 Drivers For Machine Hardware
     Loaded: loaded (/lib/systemd/system/ras-mc-ctl.service; enabled; vendor preset: enabled)
     Active: active (exited) since Mon 2023-04-24 01:50:17 CEST; 33min ago
    Process: 1011 ExecStart=/usr/sbin/ras-mc-ctl --register-labels (code=exited, status=0/SUCCESS)
   Main PID: 1011 (code=exited, status=0/SUCCESS)
        CPU: 21ms

Apr 24 01:50:17 pve systemd[1]: Starting Initialize EDAC v3.0.0 Drivers For Machine Hardware...
Apr 24 01:50:17 pve ras-mc-ctl[1011]: ras-mc-ctl: Error: No dimm labels for ASRockRack model X570D4U-2L2T
Apr 24 01:50:17 pve systemd[1]: Finished Initialize EDAC v3.0.0 Drivers For Machine Hardware

Any ideas on how this could be debugged?

@PastramiKing
Copy link

PastramiKing commented May 10, 2023

I am having a similar problem with similar hardware/software: rasdaemon v0.6.6 (From Debian, deb http://ftp.us.debian.org/debian bullseye) on Kernel 6.2 (Proxmox 7.4, 6.2.11-1-pve); ASRock X570D4U-2L2T; and AMD Ryzen 5950X.

root@pve:~# dmesg -T | grep -i mce
[Wed May 10 07:29:55 2023] MCE: In-kernel MCE decoding enabled.
[Wed May 10 08:06:13 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 08:37:21 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 09:13:40 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 09:44:48 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 10:21:07 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 10:52:15 2023] mce: [Hardware Error]: Machine check events logged
root@pve:~# ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No devlink errors.

No disk errors.

No MCE errors.
root@pve:~# ras-mc-ctl --status
ras-mc-ctl: drivers are loaded.
root@pve:~# ras-mc-ctl --status
ras-mc-ctl: drivers are loaded.
root@pve:~# systemctl status rasdaemon.service 
● rasdaemon.service - RAS daemon to log the RAS events
     Loaded: loaded (/lib/systemd/system/rasdaemon.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2023-05-10 07:29:57 EDT; 3h 36min ago
    Process: 2587 ExecStartPost=/usr/sbin/rasdaemon --enable (code=exited, status=0/SUCCESS)
   Main PID: 2582 (rasdaemon)
      Tasks: 1 (limit: 154393)
     Memory: 15.2M
        CPU: 33ms
     CGroup: /system.slice/rasdaemon.service
             └─2582 /usr/sbin/rasdaemon -f -r

May 10 07:29:57 pve rasdaemon[2582]: Enabled event ras:arm_event
May 10 07:29:57 pve rasdaemon[2582]: mce:mce_record event enabled
May 10 07:29:57 pve rasdaemon[2582]: Enabled event mce:mce_record
May 10 07:29:57 pve rasdaemon[2582]: ras:extlog_mem_event event enabled
May 10 07:29:57 pve rasdaemon[2582]: Enabled event ras:extlog_mem_event
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording mc_event events
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording aer_event events
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording extlog_event events
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording mce_record events
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording arm_event events
root@pve:~# systemctl status ras-mc-ctl
● ras-mc-ctl.service - Initialize EDAC v3.0.0 Drivers For Machine Hardware
     Loaded: loaded (/lib/systemd/system/ras-mc-ctl.service; enabled; vendor preset: enabled)
     Active: active (exited) since Wed 2023-05-10 07:29:57 EDT; 3h 37min ago
    Process: 2574 ExecStart=/usr/sbin/ras-mc-ctl --register-labels (code=exited, status=0/SUCCESS)
   Main PID: 2574 (code=exited, status=0/SUCCESS)
        CPU: 20ms

May 10 07:29:57 pve systemd[1]: Starting Initialize EDAC v3.0.0 Drivers For Machine Hardware...
May 10 07:29:57 pve ras-mc-ctl[2574]: ras-mc-ctl: Error: No dimm labels for ASRockRack model X570D4U-2L2T
May 10 07:29:57 pve systemd[1]: Finished Initialize EDAC v3.0.0 Drivers For Machine Hardware.

I receive the errors only if I am running a VM (generally TrueNas Scale), and the errors occur as follows: first error occurs after 31 minutes, 19 seconds; then second error occurs after 36 minutes, 8 seconds; then third error occurs after 31 minutes, 19 seconds; then fourth error occurs after 36 minutes, 8 seconds; and so on until I shutdown the VM and/or the Host.

EDIT: @robinchrist, your problem appears to occur approximately every 5 minutes, 11 seconds.

@robinchrist
Copy link
Author

@PastramiKing do you have any memory OC running?

I did some experimental memory OC and the errors disappeared when I returned to stock, so I assume those were memory ECC errors.

@Nuke79
Copy link

Nuke79 commented May 17, 2023

Same thing on Ubuntu server 22.04.
Linux ecc 5.15.0-72-generic #79-Ubuntu SMP Wed Apr 19 08:22:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Gigabyte B550M DS3H + Ryzen 3 PRO 3200G + Samsung 3200MHz 16Gb ECC (downvolted and tightened timings to get ECC errors).

Syslog have ecc errors:

May 17 16:27:25 ecc kernel: [ 316.509297] mce: [Hardware Error]: Machine check events logged
May 17 16:34:28 ecc kernel: [ 316.503731] mce: [Hardware Error]: Machine check events logged
May 17 16:42:36 ecc kernel: [ 316.502289] mce: [Hardware Error]: Machine check events logged
May 17 16:47:47 ecc kernel: [ 627.798510] mce: [Hardware Error]: Machine check events logged
May 17 16:52:58 ecc kernel: [ 939.094438] mce: [Hardware Error]: Machine check events logged
May 17 16:58:10 ecc kernel: [ 1250.390441] mce: [Hardware Error]: Machine check events logged
May 17 17:03:21 ecc kernel: [ 1561.686645] mce: [Hardware Error]: Machine check events logged
May 17 17:13:44 ecc kernel: [ 2184.278482] mce: [Hardware Error]: Machine check events logged

But rasdaemon (v0.6.7) says "Hey, all fine, no errors."

sudo ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No MCE errors.

sudo ras-mc-ctl --error-count
Label CE UE
DIMM_B2 0 0

@githubDiversity
Copy link

I also get non recorded mce errors roughly every 5 minutes.
running proxmox 7.4-15

~# systemctl status rasdaemon
● rasdaemon.service - RAS daemon to log the RAS events
Loaded: loaded (/lib/systemd/system/rasdaemon.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2023-07-09 14:55:22 CEST; 20h ago
Main PID: 3150285 (rasdaemon)
Tasks: 1 (limit: 38336)
Memory: 592.0K
CPU: 7ms
CGroup: /system.slice/rasdaemon.service
└─3150285 /usr/sbin/rasdaemon -f -r

Jul 09 14:55:22 pverdrmain rasdaemon[3150285]: rasdaemon: Enabled event ras:extlog_mem_event
Jul 09 14:55:22 pverdrmain rasdaemon[3150285]: Enabled event mce:mce_record
Jul 09 14:55:22 pverdrmain rasdaemon[3150285]: rasdaemon: Listening to events for cpus 0 to 11
Jul 09 14:55:22 pverdrmain rasdaemon[3150285]: ras:extlog_mem_event event enabled
Jul 09 14:55:22 pverdrmain rasdaemon[3150285]: Enabled event ras:extlog_mem_event
Jul 09 14:55:22 pverdrmain rasdaemon[3150285]: rasdaemon: Recording mc_event events
Jul 09 14:55:23 pverdrmain rasdaemon[3150285]: rasdaemon: Recording aer_event events
Jul 09 14:55:23 pverdrmain rasdaemon[3150285]: rasdaemon: Recording extlog_event events
Jul 09 14:55:24 pverdrmain rasdaemon[3150285]: rasdaemon: Recording mce_record events
Jul 09 14:55:24 pverdrmain rasdaemon[3150285]: rasdaemon: Recording arm_event events

~# systemctl status ras-mc-ctl
● ras-mc-ctl.service - Initialize EDAC v3.0.0 Drivers For Machine Hardware
Loaded: loaded (/lib/systemd/system/ras-mc-ctl.service; enabled; vendor preset: enabled)
Active: active (exited) since Sun 2023-07-09 14:55:23 CEST; 20h ago
Main PID: 3150354 (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 38336)
Memory: 0B
CPU: 0
CGroup: /system.slice/ras-mc-ctl.service

Jul 09 14:55:23 pverdrmain systemd[1]: Starting Initialize EDAC v3.0.0 Drivers For Machine Hardware...
Jul 09 14:55:23 pverdrmain ras-mc-ctl[3150354]: ras-mc-ctl: Error: No dimm labels for ASRockRack model X470D4U2-2T
Jul 09 14:55:23 pverdrmain systemd[1]: Finished Initialize EDAC v3.0.0 Drivers For Machine Hardware.

[Mon Jul 10 11:19:13 2023] mce: [Hardware Error]: Machine check events logged
[Mon Jul 10 11:24:24 2023] mce: [Hardware Error]: Machine check events logged
[Mon Jul 10 11:29:36 2023] mce: [Hardware Error]: Machine check events logged
[Mon Jul 10 11:34:47 2023] mce: [Hardware Error]: Machine check events logged
[Mon Jul 10 11:39:58 2023] mce: [Hardware Error]: Machine check events logged
[Mon Jul 10 11:45:09 2023] mce: [Hardware Error]: Machine check events logged

~# ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No devlink errors.

No disk errors.

No MCE errors.

Did anyone find out what is causing the errors and why rasdaemon 0.6.6 on debian is broken?

Is there a way to install rasdaemon 0.8 on debian 11?
I tried but then I ran into package dependency hell and backed down.

@voltagex
Copy link

@robinchrist I also have this board with these errors.

@githubDiversity
Copy link

yes 5 minutes 11 seconds. I have the same on ASRock rack X470D4U2-2T with Ryzen 5 2600 using ECC memory.

I also tried this on debian 12 with the latest rasdaemon available there. Still the same just as @robinchrist.

@mchehab Is there are more recent version I could try? or perhaps enable some debugging options that could help out?

@githubDiversity
Copy link

I think I might be on to something regarding the 5 minutes (and in our case 11 seconds) interval.
https://www.kernel.org/doc/Documentation/x86/x86_64/machinecheck
--excerpt--
check_interval
How often to poll for corrected machine check errors, in seconds
(Note output is hexadecimal). Default 5 minutes.
--end excerpt--

I could be totally wrong though, just drawing attention to it so that more knowledgeable people can decide if it is relevant or not.

@githubDiversity
Copy link

on debian 11, or rather proxmox 7.4 based on debian 11, the edac_mce_amd module is not loaded by default.
and that module seems to be needed in order to have MCE errors decipherable when using AMD CPUs.

If have got that module loaded now but still rasdaemon is not recording MCE errors

@DigiDr
Copy link

DigiDr commented Aug 21, 2023

I wanted to report that i am also experiencing this constellation of symptoms: x570d4u-2l2t, ryzen 9 5900X and 128GB of ECC RAM. Nothing is overclocked.

Proxmox: latest, running kernel 6.2.16-8-pve.

Rasdaemon is installed (and patched against the sqlite bug) but does not report any errors.

My errors are slightly more intermittent - but once the start they also have a 5minutes cadence:

[ 316.956399] mce: [Hardware Error]: Machine check events logged [ 628.242421] mce: [Hardware Error]: Machine check events logged [ 939.532553] mce: [Hardware Error]: Machine check events logged [ 1250.822880] mce: [Hardware Error]: Machine check events logged [ 1562.113834] mce: [Hardware Error]: Machine check events logged [ 1873.404440] mce: [Hardware Error]: Machine check events logged [ 2184.694951] mce: [Hardware Error]: Machine check events logged [ 2495.986726] mce: [Hardware Error]: Machine check events logged [ 2807.277286] mce: [Hardware Error]: Machine check events logged [ 3118.567873] mce: [Hardware Error]: Machine check events logged [ 3429.858463] mce: [Hardware Error]: Machine check events logged [ 3741.149033] mce: [Hardware Error]: Machine check events logged [ 4052.439293] mce: [Hardware Error]: Machine check events logged [ 4363.729819] mce: [Hardware Error]: Machine check events logged [ 4675.020201] mce: [Hardware Error]: Machine check events logged [ 4986.310628] mce: [Hardware Error]: Machine check events logged [ 5297.601098] mce: [Hardware Error]: Machine check events logged

Eventually the machine becomes unstable and needs to reboot, but usually this takes a few days.

@voltagex
Copy link

voltagex commented Aug 21, 2023 via email

@githubDiversity
Copy link

I do not believe checking memory is the next step.

I have a faulty memory module that I use to trigger memory related ECC messages. And on my setup they are reported in a separate catagory as MCE errors do.

I think this is CPU related or perhaps a systemic issue with asrock rack motherboards.

Anyway I tried getting through to AMD for technical support but that is rather difficult.
Also the AMD community website is unable to help me out with my specific inquiry regarding this 5 minute 11 seconds MCE errors that are not providing any details.

I am not sure how to proceed now. Does anyone have an email of AMD tech support?

@DigiDr
Copy link

DigiDr commented Aug 22, 2023

Can you elaborate on what you mean by unstable?I suggest running memtest86 for 24 hours and see if it repots errors.On 22 Aug 2023, at 06:05, DigiDr @.> wrote: I wanted to report that i am also experiencing this constellation of symptoms: x570d4u-2l2t, ryzen 9 5900X and 128GB of ECC RAM. Nothing is overclocked. Proxmox: latest, running kernel 6.2.16-8-pve. Rasdaemon is installed (and patched against the sqlite bug) but does not report any errors. My errors are slightly more intermittent - but once the start they also have a 5minutes cadence: [ 316.956399] mce: [Hardware Error]: Machine check events logged [ 628.242421] mce: [Hardware Error]: Machine check events logged [ 939.532553] mce: [Hardware Error]: Machine check events logged [ 1250.822880] mce: [Hardware Error]: Machine check events logged [ 1562.113834] mce: [Hardware Error]: Machine check events logged [ 1873.404440] mce: [Hardware Error]: Machine check events logged [ 2184.694951] mce: [Hardware Error]: Machine check events logged [ 2495.986726] mce: [Hardware Error]: Machine check events logged [ 2807.277286] mce: [Hardware Error]: Machine check events logged [ 3118.567873] mce: [Hardware Error]: Machine check events logged [ 3429.858463] mce: [Hardware Error]: Machine check events logged [ 3741.149033] mce: [Hardware Error]: Machine check events logged [ 4052.439293] mce: [Hardware Error]: Machine check events logged [ 4363.729819] mce: [Hardware Error]: Machine check events logged [ 4675.020201] mce: [Hardware Error]: Machine check events logged [ 4986.310628] mce: [Hardware Error]: Machine check events logged [ 5297.601098] mce: [Hardware Error]: Machine check events logged Eventually the machine becomes unstable and needs to reboot, but usually this takes a few days. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>

I don’t believe these are memory issues. My ECC ram would report these errors independently and the timing makes no sense. But we need rasdaemon to expose it in any case. As for unstable: VMs seem to hang after a few days, and this is the only link I can make to that behaviour - and the reason for investigating this.

@DigiDr
Copy link

DigiDr commented Aug 22, 2023

I do not believe checking memory is the next step.

I have a faulty memory module that I use to trigger memory related ECC messages. And on my setup they are reported in a separate catagory as MCE errors do.

I think this is CPU related or perhaps a systemic issue with asrock rack motherboards.

Anyway I tried getting through to AMD for technical support but that is rather difficult. Also the AMD community website is unable to help me out with my specific inquiry regarding this 5 minute 11 seconds MCE errors that are not providing any details.

I am not sure how to proceed now. Does anyone have an email of AMD tech support?

Given the cluster of reports with the same asrockrack boards, they should be the first line of inquiry. But we need rasdaemon to expose these errors properly.

@githubDiversity
Copy link

On debian 12 the latest version of rasdaemon is 0.6.8 but as reported here that is still buggy on debian with ryzen cpu and/or asrock rack motherboards.

So I am trying to get rasdaemon 0.8.0.x installed on debian 12 in an effort to shed some more light on these errors but I am not experienced enough to pull it off.

here I found the official source code for 0.8.0
http://www.infradead.org/~mchehab/rasdaemon/

But only compile and install instructions for fedora it seems. And that slams into walls on debian.
The src.rpm, when converted to .deb using the alien package results in the following error when installing the rasdaemon.0.8.deb

apt install ./rasdaemon_0.8.0-2_amd64.deb Reading package lists... Done Building dependency tree... Done Reading state information... Done Note, selecting 'rasdaemon' instead of './rasdaemon_0.8.0-2_amd64.deb' The following packages were automatically installed and are no longer required: g++-10 libdbd-sqlite3-perl libdbi-perl libjim0.79 libopts25 libstdc++-10-dev libtiff5 libwebp6 pve-kernel-5.13 pve-kernel-5.13.19-6-pve pve-kernel-5.15.108-1-pve python3-distro-info telnet unattended-upgrades Use 'apt autoremove' to remove them. The following packages will be upgraded: rasdaemon 1 upgraded, 0 newly installed, 0 to remove and 0 not upgraded. Need to get 0 B/403 kB of archives. After this operation, 41.0 kB of additional disk space will be used. Get:1 /root/rasdaemon/rasdaemon_0.8.0-2_amd64.deb rasdaemon amd64 0.8.0-2 [403 kB] Reading changelogs... Done (Reading database ... 99505 files and directories currently installed.) Preparing to unpack .../rasdaemon_0.8.0-2_amd64.deb ... Unpacking rasdaemon (0.8.0-2) over (0.6.8-1.1) ... Setting up rasdaemon (0.8.0-2) ... chown: invalid user: ‘mchehab:mchehab’ chown: invalid user: ‘mchehab:mchehab’ dpkg: error processing package rasdaemon (--configure): installed rasdaemon package post-installation script subprocess returned error exit status 1 Processing triggers for man-db (2.11.2-2) ... Errors were encountered while processing: rasdaemon N: Download is performed unsandboxed as root as file '/root/rasdaemon/rasdaemon_0.8.0-2_amd64.deb' couldn't be accessed by user '_apt'. - pkgAcquire::Run (13: Permission denied) E: Sub-process /usr/bin/dpkg returned an error code (1)

notice the user mchehab:mchehab not being found.

Looks like hard coded user names in source?

Anyway I am way out of my league here. Can anyone please point me into the right direction?

@githubDiversity
Copy link

or i could give nixos a try using a separate external usb as drive to install on. nixos seems to be able install any package at any version. Including 0.8

would that be worth the trouble?

@DigiDr
Copy link

DigiDr commented Aug 25, 2023

@githubDiversity I think i've managed it on debian (proxmox):

rm -r /var/lib/rasdaemon/ras-mc_event.db (we need the install to recreate this later with the right tables)

apt-get install make gcc autoconf automake libtool libevent-dev tar libsqlite3-dev libdbd-sqlite3-perl  libtraceevent-dev libtraceevent pkg-config


cd ~/
wget https://www.infradead.org/~mchehab/rasdaemon/rasdaemon-0.8.0.tar.bz2
tar -xvf rasdaemon-0.8.0.tar.bz2
cd rasdaemon-0.8.0

autoreconf -vfi
./configure  --enable-all --localstatedir=/var
make
make install

This left me on the latest version with all feature flags enabled. I'll let you know what i discover about these errors.

--localstatedir=/var forces it to use the default location for debian.

compile time options summary
============================

    Sqlite3             : yes
    AER                 : yes
    MCE                 : yes
    EXTLOG              : yes
    CPER non-standard   : yes
    ABRT report         : yes
    HISI Kunpeng errors : yes
    ARM events          : yes
    DEVLINK             : yes
    Disk I/O errors     : yes
    Memory Failure      : yes
    Memory CE PFA       : yes
    AMP RAS errors      : yes
    CPU fault isolation : yes

You can alternatively compile with some/other options from /configure --help if you don't want the whole lot enabled.

@DigiDr
Copy link

DigiDr commented Aug 25, 2023

Alas, this still hasn't revealed the source of the mce event

No Memory errors.

No PCIe AER errors.

No ARM processor errors.

No Extlog errors.

No devlink errors.

No disk errors.

No Memory failure errors.

No MCE errors.

@githubDiversity
Copy link

githubDiversity commented Aug 26, 2023

thank you @DigiDr for showing how to install from source on Debian.

Such a bummer that it lead no where.

I also have an Asrock rack X470D4U board with an AMD Ryzen 5 2600 Pro on it. On that setup I get no 5 min 11 seconds MCE errors.
Once I am back from holiday I will swap the Ryzen 5 2600 with the Ryzen 5 2600 Pro on this X470D4U2-2T board.

If the errors then disappear then I think I have confirmed it is the CPU and not the board. Or might that not be the correct conclusion?

Anyway, here is a link to a thread with what I think is a related phenomenon.
https://forum.level1techs.com/t/mce-corrected-errors/175366
It also mentions Red Hat telling customers with AMD CPUs to do additional steps like loading this edac_mce_amd module I mentioned a few posts earlier.

You might try loading the edac_mce_amd module and see if that enables rasdaemon 0.8 to make sense of the errors.
I tried with rasdaemon 0.6.6 earlier but that did not change anything. I am not even sure if it is relevant this edac_mce_amd module

One final thing I will try is install windows server 2022 in the hopes that will uncover what is going on every 5 minutes and 11 seconds.

After that I am at my whits end and would hope @mchehab could pitch in.

@TiborGY
Copy link

TiborGY commented Aug 26, 2023

Dear all,
I think the problem is going to be related to some settings in the BIOS/UEFI. AMD firmware has a lot of barely documented options related to MCE handling. For example, on some desktop AM4 boards, with some firmware versions, one needs to set Platform First Error Handling to disabled, otherwise ECC errors will not show up in the kernel logs. But beyond that there are a lot more options with no real documentation, like MCA error thresholding, and many more.

Therefore I think the solution to your problems could be to change one or more of the obscure RAS/MCA related firmware configuration options.

@robinchrist
Copy link
Author

otherwise ECC errors will not show up in the kernel logs. But beyond that there are a lot more options with no real documentation, like MCA error thresholding, and many more.

but if it shows up in the kernel logs (which it does, because otherwise we wouldn't know that rasdaemon doesn't report them), shouldn't rasdaemon be able to make sense of it?

Or can it be that the machine just reports "there was some error" to the OS but no additional information about what exactly etc?

Maybe some expert can jump in and help

@TiborGY
Copy link

TiborGY commented Aug 26, 2023

Or can it be that the machine just reports "there was some error" to the OS but no additional information about what exactly etc?

This is exactly what I am suspecting.

@githubDiversity
Copy link

I am fighting to get windows installed. damed that is hard on bare metal these days ;( anyway can you guys please check the voltage level of the onboard battery? you can find that in the overview page of the IPMI interface.

Mine is at 0.0V and I noticed also battery low erros. Not sure if it is related though

@githubDiversity
Copy link

githubDiversity commented Aug 31, 2023

ok I am getting closer (giving up on installing windows for the time being. it's too difficult grrr)

So it turns out that tracing was not enabled by default on debian 11 (proxmox 7.3)
upgrading to debian 12 (proxmox 8.4) does not change anything.

cat /sys/kernel/debug/tracing/events/mce/mce_record/enable
0
should be
1

after I enabled it I get this in the trace
`
cat /sys/kernel/debug/tracing/trace

tracer: nop

entries-in-buffer/entries-written: 4/4 #P:12

_-----=> irqs-off/BH-disabled

/ _----=> need-resched

| / _---=> hardirq/softirq

|| / _--=> preempt-depth

||| / _-=> migrate-disable

|||| / delay

TASK-PID CPU# ||||| TIMESTAMP FUNCTION

| | | ||||| | |

 kworker/0:2-89      [000] .....  6853.046146: mce_record: CPU: 0, MCGc/s: 117/0, MC15: dc2040000000011b, IPID: 0000009600050f00, ADDR/MISC/SYND: 00000003c0b42f00/d01b0fff01000000/000002630a400a02, RIP: 00:<0000000000000000>, TSC: 0, PROCESSOR: 2:800f82, TIME: 1693507232, SOCKET: 0, APIC: 0
 kworker/0:2-89      [000] .....  7164.338881: mce_record: CPU: 0, MCGc/s: 117/0, MC15: dc2040000000011b, IPID: 0000009600050f00, ADDR/MISC/SYND: 00000003c0b42f00/d01b0fff01000000/000002630a400a02, RIP: 00:<0000000000000000>, TSC: 0, PROCESSOR: 2:800f82, TIME: 1693507544, SOCKET: 0, APIC: 0
 kworker/0:2-89      [000] .....  7475.635616: mce_record: CPU: 0, MCGc/s: 117/0, MC15: dc2040000000011b, IPID: 0000009600050f00, ADDR/MISC/SYND: 00000003c0b42f00/d01b0fff01000000/000002630a400a02, RIP: 00:<0000000000000000>, TSC: 0, PROCESSOR: 2:800f82, TIME: 1693507855, SOCKET: 0, APIC: 0
 kworker/0:2-89      [000] .....  7786.924353: mce_record: CPU: 0, MCGc/s: 117/0, MC15: dc2040000000011b, IPID: 0000009600050f00, ADDR/MISC/SYND: 00000003c0b42f00/d01b0fff01000000/000002630a400a02, RIP: 00:<0000000000000000>, TSC: 0, PROCESSOR: 2:800f82, TIME: 1693508166, SOCKET: 0, APIC: 0

systemctl status rasdaemon
● rasdaemon.service - RAS daemon to log the RAS events
Loaded: loaded (/lib/systemd/system/rasdaemon.service; enabled; preset: enabled)
Active: active (running) since Thu 2023-08-31 20:48:20 CEST; 8min ago
Main PID: 228553 (rasdaemon)
Tasks: 26 (limit: 38328)
Memory: 6.4M
CPU: 144ms
CGroup: /system.slice/rasdaemon.service
└─228553 /usr/sbin/rasdaemon -f -r

Aug 31 20:48:20 rasdaemon[228553]: rasdaemon: Recording extlog_event events
Aug 31 20:48:20 rasdaemon[228553]: rasdaemon: Recording extlog_event events
Aug 31 20:48:20 rasdaemon[228553]: rasdaemon: Recording mce_record events
Aug 31 20:48:20 rasdaemon[228553]: rasdaemon: read
Aug 31 20:48:20 rasdaemon[228553]: rasdaemon: Recording mce_record events
Aug 31 20:48:20 rasdaemon[228553]: rasdaemon: read
Aug 31 20:48:20 rasdaemon[228553]: rasdaemon: Recording mce_record events
Aug 31 20:48:20 rasdaemon[228553]: rasdaemon: read
Aug 31 20:48:20 rasdaemon[228553]: rasdaemon: Recording mce_record events
Aug 31 20:50:55 rasdaemon[228553]: rasdaemon: mce_record store: 0x7fe03c022c88`

But still rasdaemon is not recording those errors, let alone decoding them into human readable format.

Is not this a serious issue that if tracing is not enabled by default then a lot of people might feel covered by rasdaemon while they are not?
I wonder if the proxmox team is aware of this. or perhaps this is something that better configured if one uses the enterprise subscription

@githubDiversity
Copy link

thanks to @DigiDr 's compile instructions I am now running rasdaemon 0.8

tracing is enabled en the trace is being populated.

But still rasdaemon is not playing ball
/usr/local/sbin/ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No ARM processor errors.

No Extlog errors.

No devlink errors.

No disk errors.

No Memory failure errors.

No MCE errors.

I think now is the time to escalate to @mchehab. rasdaemon seems seriously broken for several versions already.

@githubDiversity
Copy link

githubDiversity commented Sep 1, 2023

rasdaemon silently crashes with a segmentation fault after a few hours running in the foreground.

`
rasdaemon -r -f
rasdaemon: Improper PAGE_CE_ACTION, set to default soft
rasdaemon: Page offline choice on Corrected Errors is soft
rasdaemon: Improper PAGE_CE_THRESHOLD, set to default 50.
rasdaemon: Improper PAGE_CE_REFRESH_CYCLE, set to default 24h.
rasdaemon: Threshold of memory Corrected Errors is 50 / 24h
rasdaemon: ras:mc_event event enabled
rasdaemon: Enabled event ras:mc_event
rasdaemon: ras:aer_event event enabled
rasdaemon: Enabled event ras:aer_event
rasdaemon: ras:non_standard_event event enabled
rasdaemon: Enabled event ras:non_standard_event
rasdaemon: ras:arm_event event enabled
rasdaemon: Enabled event ras:arm_event
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu0/online failed
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu12/online failed
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu13/online failed
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu14/online failed
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu15/online failed
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu16/online failed
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu17/online failed
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu18/online failed
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu19/online failed
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu20/online failed
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu21/online failed
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu22/online failed
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu23/online failed
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu24/online failed
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu25/online failed
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu26/online failed
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu27/online failed
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu28/online failed
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu29/online failed
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu30/online failed
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu31/online failed
rasdaemon: Cpu fault isolation is disabled
rasdaemon: mce:mce_record event enabled
rasdaemon: Enabled event mce:mce_record
rasdaemon: ras:extlog_mem_event event enabled
rasdaemon: Enabled event ras:extlog_mem_event
rasdaemon: net:net_dev_xmit_timeout event enabled
rasdaemon: Enabled event net:net_dev_xmit_timeout
rasdaemon: devlink:devlink_health_report event enabled
rasdaemon: Enabled event devlink:devlink_health_report
rasdaemon: block:block_rq_error event enabled
rasdaemon: Enabled event block:block_rq_error
rasdaemon: ras:memory_failure_event event enabled
rasdaemon: Enabled event ras:memory_failure_event
rasdaemon: Listening to events for cpus 0 to 31
Calling ras_mc_event_opendb()
rasdaemon: Recording mc_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording extlog_event events
rasdaemon: Recording mce_record events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording arm_event events
rasdaemon: Recording devlink_event events
rasdaemon: Recording disk_errors events
rasdaemon: Recording memory_failure_event events
rasdaemon: Error on CPU 12
rasdaemon: Error on CPU 13
rasdaemon: Error on CPU 14
rasdaemon: Error on CPU 15
rasdaemon: Error on CPU 16
rasdaemon: Error on CPU 17
rasdaemon: Error on CPU 18
rasdaemon: Error on CPU 19
rasdaemon: Error on CPU 20
rasdaemon: Error on CPU 21
rasdaemon: Error on CPU 22
rasdaemon: Error on CPU 23
rasdaemon: Error on CPU 24
rasdaemon: Error on CPU 25
rasdaemon: Error on CPU 26
rasdaemon: Error on CPU 27
rasdaemon: Error on CPU 28
rasdaemon: Error on CPU 29
rasdaemon: Error on CPU 30
rasdaemon: Error on CPU 31
rasdaemon: Old kernel detected. Stop listening and fall back to pthread way.
Calling ras_mc_event_closedb()
rasdaemon: Listening to events on cpu 0
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 1
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 2
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 3
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 4
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 5
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 6
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 7
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 8
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 9
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 10
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 11
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 12
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 14
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 13
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 15
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 16
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 17
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 18
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 19
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 20
rasdaemon: Listening to events on cpu 23
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 21
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 22
Calling ras_mc_event_opendb()
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 24
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 25
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 26
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 28
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 27
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 29
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 30
Calling ras_mc_event_opendb()
rasdaemon: Listening to events on cpu 31
Calling ras_mc_event_opendb()
rasdaemon: Recording mc_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording extlog_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording extlog_event events
rasdaemon: Recording extlog_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording extlog_event events
rasdaemon: Recording extlog_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording extlog_event events
rasdaemon: Recording extlog_event events
rasdaemon: Recording extlog_event events
rasdaemon: Recording extlog_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording extlog_event events
rasdaemon: Recording extlog_event events
rasdaemon: Recording mce_record events
rasdaemon: Recording extlog_event events
rasdaemon: Recording extlog_event events
rasdaemon: Recording extlog_event events
rasdaemon: Recording extlog_event events
rasdaemon: Recording extlog_event events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording mce_record events
rasdaemon: Recording mce_record events
rasdaemon: Recording mce_record events
rasdaemon: Recording mce_record events
rasdaemon: Recording arm_event events
rasdaemon: Recording mce_record events
rasdaemon: rasdaemon: Recording mce_record events
Recording mc_event events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording mce_record events
rasdaemon: Recording mce_record events
rasdaemon: Recording mce_record events
rasdaemon: Recording mce_record events
rasdaemon: Recording mce_record events
rasdaemon: Recording devlink_event events
rasdaemon: Recording mce_record events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording arm_event events
rasdaemon: Recording mce_record events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording mce_record events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording disk_errors events
rasdaemon: Recording mce_record events
rasdaemon: Recording arm_event events
rasdaemon: Recording devlink_event events
rasdaemon: Recording arm_event events
rasdaemon: Recording arm_event events
rasdaemon: rasdaemon: Recording arm_event events
rasdaemon: Recording non_standard_event events
rasdaemon: rasdaemon: Recording arm_event events
rasdaemon: Recording arm_event events
rasdaemon: rasdaemon: Recording arm_event events
Recording arm_event events
Recording non_standard_event events
Recording mc_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording arm_event events
rasdaemon: Recording arm_event events
rasdaemon: Recording devlink_event events
rasdaemon: Recording memory_failure_event events
rasdaemon: Recording arm_event events
rasdaemon: Recording devlink_event events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording devlink_event events
rasdaemon: Recording devlink_event events
rasdaemon: Recording devlink_event events
rasdaemon: Recording devlink_event events
rasdaemon: rasdaemon: rasdaemon: Recording disk_errors events
Recording devlink_event events
Recording arm_event events
rasdaemon: Recording devlink_event events
rasdaemon: Recording devlink_event events
rasdaemon: Recording devlink_event events
rasdaemon: Recording arm_event events
rasdaemon: Recording arm_event events
rasdaemon: Recording disk_errors events
rasdaemon: Recording disk_errors events
rasdaemon: Recording extlog_event events
rasdaemon: Recording devlink_event events
rasdaemon: Recording mc_event events
rasdaemon: rasdaemon: Recording disk_errors events
Recording memory_failure_event events
rasdaemon: Recording disk_errors events
rasdaemon: Recording disk_errors events
rasdaemon: Recording disk_errors events
rasdaemon: Recording aer_event events
rasdaemon: Recording memory_failure_event events
rasdaemon: Recording devlink_event events
rasdaemon: Recording disk_errors events
rasdaemon: Recording disk_errors events
rasdaemon: Recording devlink_event events
rasdaemon: Recording memory_failure_event events
rasdaemon: read
Calling ras_mc_event_closedb()
rasdaemon: Recording devlink_event events
rasdaemon: Recording memory_failure_event events
rasdaemon: Recording memory_failure_event events
rasdaemon: read
Calling ras_mc_event_closedb()
rasdaemon: Recording disk_errors events
rasdaemon: Recording memory_failure_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording disk_errors events
rasdaemon: Recording memory_failure_event events
rasdaemon: Recording memory_failure_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording disk_errors events
rasdaemon: Recording disk_errors events
rasdaemon: Recording memory_failure_event events
rasdaemon: Recording disk_errors events
rasdaemon: Recording memory_failure_event events
rasdaemon: read
Calling ras_mc_event_closedb()
rasdaemon: Recording memory_failure_event events
rasdaemon: read
rasdaemon: Recording mc_event events
Calling ras_mc_event_closedb()
rasdaemon: Recording aer_event events
rasdaemon: Recording disk_errors events
rasdaemon: Recording memory_failure_event events
rasdaemon: Recording memory_failure_event events
rasdaemon: read
Calling ras_mc_event_closedb()
rasdaemon: read
Calling ras_mc_event_closedb()
rasdaemon: Recording extlog_event events
rasdaemon: Recording memory_failure_event events
rasdaemon: read
Calling ras_mc_event_closedb()
rasdaemon: Recording memory_failure_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording extlog_event events
rasdaemon: Recording extlog_event events
rasdaemon: Recording mce_record events
rasdaemon: Recording aer_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording mce_record events
rasdaemon: Recording mc_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording aer_event events
rasdaemon: Recording mc_event events
rasdaemon: Recording extlog_event events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording aer_event events
rasdaemon: rasdaemon: Recording mc_event events
Recording extlog_event events
rasdaemon: Recording aer_event events
rasdaemon: rasdaemon: Recording extlog_event events
rasdaemon: Recording mce_record events
Recording arm_event events
rasdaemon: Recording extlog_event events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording devlink_event events
rasdaemon: Recording mce_record events
rasdaemon: Recording arm_event events
rasdaemon: Recording disk_errors events
rasdaemon: Recording non_standard_event events
rasdaemon: rasdaemon: Recording memory_failure_event events
rasdaemon: read
Calling ras_mc_event_closedb()
rasdaemon: Recording aer_event events
rasdaemon: Recording mce_record events
rasdaemon: Recording extlog_event events
Recording mce_record events
rasdaemon: Recording devlink_event events
rasdaemon: Recording arm_event events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording disk_errors events
rasdaemon: rasdaemon: Recording aer_event events
Recording devlink_event events
rasdaemon: Recording memory_failure_event events
rasdaemon: read
rasdaemon: rasdaemon: rasdaemon: Recording disk_errors events
Recording arm_event events
rasdaemon: Recording extlog_event events
Recording mce_record events
rasdaemon: Recording arm_event events
Calling ras_mc_event_closedb()
rasdaemon: Recording memory_failure_event events
rasdaemon: read
Calling ras_mc_event_closedb()
rasdaemon: Recording devlink_event events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording devlink_event events
rasdaemon: Recording mce_record events
rasdaemon: Recording disk_errors events
rasdaemon: Recording arm_event events
rasdaemon: Recording non_standard_event events
rasdaemon: Recording memory_failure_event events
rasdaemon: Recording devlink_event events
rasdaemon: Recording arm_event events
rasdaemon: Recording disk_errors events
rasdaemon: Recording devlink_event events
rasdaemon: Recording memory_failure_event events
rasdaemon: read
Calling ras_mc_event_closedb()
rasdaemon: Recording disk_errors events
rasdaemon: Recording disk_errors events
rasdaemon: Recording memory_failure_event events
rasdaemon: read
Calling ras_mc_event_closedb()
rasdaemon: Recording memory_failure_event events
rasdaemon: read
Calling ras_mc_event_closedb()
rasdaemon: mce_record store: 0x7f3794022ca8

Segmentation fault

`

but the rasdaemon sqlite.db remains unchanged
/var/lib/rasdaemon# ls -l
total 5
-rw-r--r-- 1 root root 40960 Aug 31 21:37 ras-mc_event.db

/usr/local/sbin/ras-mc-ctl --status
ras-mc-ctl: drivers are loaded.

/usr/local/sbin/ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No ARM processor errors.

No Extlog errors.

No devlink errors.

No disk errors.

No Memory failure errors.

No MCE errors.

From the output of running rasdaemon in the foreground I've noticed rasdaemon tried listening, opening files for, cpu's that do not exist.

I have the Ryzen 5 2600 with 6 core / 12 threads. Not sure why it tried to do things with cpus up to 31.

`
ls /sys/devices/system/cpu/
cpu0 cpu10 cpu2 cpu4 cpu6 cpu8 cpufreq hotplug kernel_max modalias online power smt vulnerabilities
cpu1 cpu11 cpu3 cpu5 cpu7 cpu9 cpuidle isolated microcode offline possible present uevent

`

@githubDiversity
Copy link

I no longer think this is the best place to discuss our issues as rasdaemon, although not working as expected, is no longer related.

How I came to that conclusing below.

If one wants to follow my progress as I pint down the exact cause please DM me and I will create a thread on the asrock rack support forums.


tried running fedora 38 kde plasma from a live usb for a while

installed rasdaemon 0.8.1

same 5 minutes 11 seconds thing but still rasdaemon seems broken. nothing recorded even after I enabled tracing.
which I think should be check if it is enabled by rasdaemon on startup.

Same tracing info as on debian 12.

I replaced the CPU with a ryzen 5 2600 pro but still the same 5 min 11 seconds mce errors.
I did manage to destroy my ability to remote view via IPMI though. yeee no good deed goes unpunished ;(

So now I will start removing PCI connected devices and see if that changes things but the progress I will not share here as not related to rasdaemon.

@githubDiversity
Copy link

I am not well versed in this site.
Can I make/receive Private Messages here?

Anyway I have gone and opened a support ticket at asrockrack. Since these are server products the asrock forum is not the place to ask questions

@githubDiversity
Copy link

As per usual Asrock Rack tech support is interested to help out and we are now in the process of digging down.

But I really think that as soon as linux systems uses rasdaemon then it should work out of the box or at least tell the admin why it will probably not work.
@mchehab I really think we need your involvement right now. this is no longer just an isolated case.

@voltagex
Copy link

voltagex commented Sep 5, 2023 via email

@githubDiversity
Copy link

like i stated earlier I do not believe this is the correct place to discuss mobo related issues.

All I hope for is that the issue regarding rasdaemon seemingly being broken for a long time already gets the needed attention it needs as many people run it.

@githubDiversity
Copy link

any news on the rasdaemon side? I mean this issue is still open without (as far as I can tell) a response from the maintainers.

If that is correct then I am really worried about the state of open source operating systems. I mean most of them have this piece of software as the default.

@githubDiversity
Copy link

Asrock Netherlands asked me to jump through a certain amount of hoops before they would relay our board (not rasdaemon) issue to the engineers in taiwan.

I jumped through all of them and then I got told by asrock Netherlands to just simply send back the board to the reseller.

Well that is asking alot after <i have rendered 2 of my cpu's useless by upgrading the bios as they asked.

I have opened a post on the asrock forum. I am sure people will say there that this is the asrtock forum and not asrock rack. blah blah.

That is why I did not open yet with the actual board type.

https://forum.asrock.com/forum_posts.asp?TID=26816&PID=107896&#107896

@githubDiversity
Copy link

What is the status here if I may ask? Are your board issues resolved

Is rasdaemon by now able to report as expected?

@voltagex
Copy link

Asrock Netherlands asked me to jump through a certain amount of hoops before they would relay our board (not rasdaemon) issue to the engineers in taiwan.

I jumped through all of them and then I got told by asrock Netherlands to just simply send back the board to the reseller.

Well that is asking alot after <i have rendered 2 of my cpu's useless by upgrading the bios as they asked.

I have opened a post on the asrock forum. I am sure people will say there that this is the asrtock forum and not asrock rack. blah blah.

Different company, of course they will.

That is why I did not open yet with the actual board type.

https://forum.asrock.com/forum_posts.asp?TID=26816&PID=107896&#107896

@voltagex
Copy link

What is the status here if I may ask? Are your board issues resolved

Is rasdaemon by now able to report as expected?

As of kernel 6.5 I am getting reports in dmesg WITHOUT any additional configuration of rasdaemon

@ecclex
Copy link

ecclex commented Oct 18, 2023

I had this issue bookmarked when I was fixing the ECC reporting on my consumer AM5 B650 motherboard and after I run:

sudo systemctl start rasdaemon.service
sudo systemctl enable rasdaemon.service
sudo systemctl start ras-mc-ctl.service
sudo systemctl enable ras-mc-ctl.service

, ras started reporting it.
The only issue I see is that while the ECC reporting happens, sudo ras-mc-ctl --error-count isn't showing anything, but sudo ras-mc-ctl --summary and sudo ras-mc-ctl --errors do.

If after the services are started and still no ECC reporting, then I'd expect it to be a mobo/BIOS issue. If you still have issues, then I'd prob just sell the mobo, get a SUPERMICRO one and move on.

Hardware: ASUS TUF GAMING B650-PLUS, 7800X3D and 2xKSM48E40BD8KM-32HM.

@githubDiversity
Copy link

githubDiversity commented Oct 26, 2023

Yeah, supermicro is starting to sound more palletable. But let's focus on rasdaemon.

I think I have some new info. It seems that the watchdog timer on our boards is bonkers.
That would explain the precise cyclic nature.
I have had numerous encounters while shutting down with watchdog did not shutdown/respond.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants