mergerfs pool mount fails. Best way to monitor and self-recovery via script? #1098

TheLinuxGuy · 2022-11-20T20:25:54Z

TheLinuxGuy
Nov 20, 2022

I'm running a tiered cache setup (ZFS cache pool, slow disk on second mergerfs pool). The primary pool is the NFS mount point of linux hosts that copy a lot of data via rsync. I have found that sometimes FUSE mount fails causing NFS to fail as well; it doesn't recover from this failure.

On the host running mergerfs pools; this is what I see

root@nas:/home/gfm# ls -lah /mnt/slow-storage/ | wc -l
13
root@nas:/home/gfm# ls -lah /mnt/cached
ls: cannot access '/mnt/cached': Input/output error

Here's /etc/fstab (edit: I just realized i misplaced "msplfs" in the wrong mount as I write this post. However my question about monitoring and self-healing is still something I wish to know.

# mergefs - stich all slow disks.
/mnt/disk* /mnt/slow-storage fuse.mergerfs defaults,nonempty,allow_other,use_ino,category.create=msplfs,cache.files=off,moveonenospc=true,dropcacheonclose=true,minfreespace=300G,fsname=mergerfs 0 0
# mergerfs - fast nvme cache w/ NFS settings.
/cache:/mnt/slow-storage /mnt/cached fuse.mergerfs defaults,nonempty,allow_other,use_ino,noforget,inodecalc=path-hash,security_capability=false,cache.files=partial,category.create=lfs,moveonenospc=true,dropcacheonclose=true,minfreespace=4G,fsname=mfs-cache 0 0

The ZFS pool and mount point is fine.

root@nas:/home/gfm# ls -lah /cache/ | wc -l
11

How I recover from this is remounting and restarting NFS:

root@nas:/home/gfm# umount -f -l /mnt/cached/
root@nas:/home/gfm# mount /mnt/cached/
root@nas:/home/gfm# ls -lah /mnt/cached | wc -l
13
systemctl restart nfs-kernel-server

Info

root@nas:/home/gfm# uname -a
Linux nas 5.15.0-53-generic #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
root@nas:/home/gfm# mergerfs -v
mergerfs version: 2.33.5

This tells me /mnt/cached mergerfs cache pool failed. I am not super sure just yet to best troubleshoot the root-cause of the mergerfs crash but I would want to figure out a recommended way to have a monitoring and self-healing script as it seems that mergerfs is not catching itself failing.

Answered by trapexit

Nov 23, 2022

If it's crashing then strace is pretty useless. Need a stack trace from gdb.

gdb path/to/mergerfs

run -f -o options branches mountpoint

when it crashes

thread apply all bt

View full answer

trapexit · 2022-11-20T21:06:01Z

trapexit
Nov 20, 2022
Maintainer

I have found that sometimes FUSE mount fails

Why?

It shouldn't fail and shouldn't need to be repaired. There is nothing special about mergerfs vs any other filesystem.

2 replies

TheLinuxGuy Nov 20, 2022
Author

Why?

It's possible that using msplfs in /mnt/cache mount setting has caused it. I reverted the setting to 'lfs' as it was before as that is what should have been and screwed up when I wanted to start experimenting with category.create=msplfs

It shouldn't fail and shouldn't need to be repaired. There is nothing special about mergerfs vs any other filesystem.

Where would you find any FUSE errors if any were to be triggered? I did try syslog 'grep fuse' but only a single entry there at boot.

trapexit Nov 20, 2022
Maintainer

If you're saying msplfs causing crashes then let me look into that.

trapexit · 2022-11-20T21:19:23Z

trapexit
Nov 20, 2022
Maintainer

There are no "errors" to be reported. Things mount or not. If there are config errors it gets reported like any other. If there is a bug and it crashes there wouldn't be something logged except maybe in kernel messages depending on the system setup.

1 reply

TheLinuxGuy Nov 20, 2022
Author

If there is a bug and it crashes there wouldn't be something logged except maybe in kernel messages depending on the system setup.

This has been my observation; the mergerfs cache pool mounts successfully. Works for awhile; then local filesystem will report "Input/output error" - if you were seeing this on your system, what steps would you consider doing to detect and debug?

I do see some oddity here in syslog

root@nas:/home/gfm# tail -n 100 /var/log/syslog
Nov 20 17:58:32 nas pcp-pmie[2510]: High aggregate context switch rate 41175ctxsw/s@nas
Nov 20 17:59:01 nas CRON[359853]: (root) CMD (/home/gfm/unraid-mover.sh)
Nov 20 18:00:15 nas systemd[1]: Started Timeline of Snapper Snapshots.
Nov 20 18:00:15 nas dbus-daemon[1322]: [system] Activating via systemd: service name='org.opensuse.Snapper' unit='snapperd.service' requested by ':1.125' (uid=0 pid=366063 comm="/usr/lib/snapper/systemd-helper --timeline " label="unconfined")
Nov 20 18:00:15 nas systemd[1]: Starting DBus interface for snapper...
Nov 20 18:00:15 nas dbus-daemon[1322]: [system] Successfully activated service 'org.opensuse.Snapper'
Nov 20 18:00:15 nas systemd[1]: Started DBus interface for snapper.
Nov 20 18:00:15 nas systemd[1]: snapper-timeline.service: Deactivated successfully.
Nov 20 18:01:15 nas systemd[1]: snapperd.service: Deactivated successfully.
Nov 20 18:05:01 nas CRON[387324]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Nov 20 18:05:01 nas rpc.mountd[1601]: authenticated unmount request from 192.168.1.254:1009 for /mnt/cached/movies (/mnt/cached)
Nov 20 18:05:01 nas rpc.mountd[1601]: authenticated unmount request from 192.168.1.254:642 for /mnt/cached/tv (/mnt/cached)
Nov 20 18:05:01 nas rpc.mountd[1601]: authenticated unmount request from 192.168.1.254:682 for /mnt/cached/movies (/mnt/cached)
Nov 20 18:05:01 nas rpc.mountd[1601]: authenticated unmount request from 192.168.1.254:736 for /mnt/cached/tv (/mnt/cached)
Nov 20 18:06:59 nas systemd[1]: Starting Refresh fwupd metadata and update motd...
Nov 20 18:07:00 nas dbus-daemon[1322]: [system] Activating via systemd: service name='org.freedesktop.fwupd' unit='fwupd.service' requested by ':1.134' (uid=118 pid=400447 comm="/usr/bin/fwupdmgr refresh " label="unconfined")
Nov 20 18:07:00 nas systemd[1]: Starting Firmware update daemon...
Nov 20 18:07:01 nas kernel: [ 4965.205778] Lockdown: fwupd: /dev/mem,kmem,port is restricted; see man kernel_lockdown.7
Nov 20 18:07:01 nas dbus-daemon[1322]: [system] Activating via systemd: service name='org.freedesktop.UPower' unit='upower.service' requested by ':1.135' (uid=0 pid=400521 comm="/usr/libexec/fwupd/fwupd " label="unconfined")
Nov 20 18:07:01 nas systemd[1]: Starting Daemon for power management...
Nov 20 18:07:01 nas dbus-daemon[1322]: [system] Successfully activated service 'org.freedesktop.UPower'
Nov 20 18:07:01 nas systemd[1]: Started Daemon for power management.
Nov 20 18:07:01 nas dbus-daemon[1322]: [system] Successfully activated service 'org.freedesktop.fwupd'
Nov 20 18:07:01 nas systemd[1]: Started Firmware update daemon.
Nov 20 18:07:01 nas fwupdmgr[400447]: Updating lvfs
Nov 20 18:07:01 nas fwupdmgr[400447]: Downloading…: 0%
Nov 20 18:07:02 nas fwupdmgr[400447]: Idle…: 100%
Nov 20 18:07:02 nas fwupdmgr[400447]: message repeated 2 times: [ Idle…: 100%]
Nov 20 18:07:02 nas fwupdmgr[400447]: Successfully downloaded new metadata: 1 local device supported
Nov 20 18:07:02 nas systemd[1]: fwupd-refresh.service: Deactivated successfully.
Nov 20 18:07:02 nas systemd[1]: Finished Refresh fwupd metadata and update motd.
Nov 20 18:07:11 nas CRON[359852]: (CRON) info (No MTA installed, discarding output)
Nov 20 18:08:32 nas pcp-pmie[2510]: High aggregate context switch rate 42111ctxsw/s@nas
Nov 20 18:14:32 nas pcp-pmie[2510]: Severe demand for real memory 8.8pgsout/s@nas
Nov 20 18:15:02 nas CRON[414497]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Nov 20 18:15:02 nas CRON[414496]: (root) CMD (/home/gfm/unraid-mover.sh)
Nov 20 18:16:32 nas pcp-pmie[2510]: High 1-minute load average 12.1load@nas
Nov 20 18:17:02 nas CRON[424169]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Nov 20 18:18:32 nas pcp-pmie[2510]: High aggregate context switch rate 46288ctxsw/s@nas
Nov 20 18:24:32 nas pcp-pmie[2510]: Severe demand for real memory 14.0pgsout/s@nas
Nov 20 18:25:01 nas CRON[465713]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Nov 20 18:25:01 nas systemd[1]: Starting Check pmlogger instances are running...
Nov 20 18:25:01 nas systemd[1]: Started Check pmlogger instances are running.
Nov 20 18:25:01 nas systemd[1]: pmlogger_check.service: Deactivated successfully.
Nov 20 18:25:18 nas systemd[1]: Starting Check and migrate non-primary pmlogger farm instances...
Nov 20 18:25:18 nas systemd[1]: Started Check and migrate non-primary pmlogger farm instances.
Nov 20 18:25:18 nas systemd[1]: pmlogger_farm_check.service: Deactivated successfully.
Nov 20 18:25:44 nas CRON[414495]: (CRON) info (No MTA installed, discarding output)
Nov 20 18:28:15 nas systemd[1]: Starting Check PMIE instances are running...
Nov 20 18:28:15 nas systemd[1]: Starting Check and migrate non-primary pmie farm instances...
Nov 20 18:28:15 nas systemd[1]: Started Check and migrate non-primary pmie farm instances.
Nov 20 18:28:15 nas systemd[1]: Started Check PMIE instances are running.
Nov 20 18:28:15 nas systemd[1]: pmie_farm_check.service: Deactivated successfully.
Nov 20 18:28:15 nas systemd[1]: pmie_check.service: Deactivated successfully.
Nov 20 18:28:32 nas pcp-pmie[2510]: High aggregate context switch rate 44328ctxsw/s@nas
Nov 20 18:29:35 nas hd-idle[1328]: sdb spindown
Nov 20 18:29:35 nas hd-idle[1328]: sde spindown
Nov 20 18:30:01 nas CRON[470058]: (root) CMD (/home/gfm/unraid-mover.sh)
Nov 20 18:34:32 nas pcp-pmie[2510]: Severe demand for real memory 6.6pgsout/s@nas
Nov 20 18:34:36 nas hd-idle[1328]: sdb spinup
Nov 20 18:35:01 nas CRON[470297]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Nov 20 18:35:36 nas hd-idle[1328]: sde spinup
Nov 20 18:36:09 nas systemd[1]: mnt-cached.mount: Deactivated successfully.
Nov 20 18:36:09 nas systemd[1]: mnt-cached.mount: Unit process 540 (mergerfs) remains running after unit stopped.
Nov 20 18:36:09 nas systemd[1]: mnt-cached.mount: Consumed 6min 1.120s CPU time.
Nov 20 18:36:31 nas rpc.mountd[1601]: v4.0 client detached: (null) from (null)
Nov 20 18:36:31 nas rpc.mountd[1601]: v4.2 client detached: 0xc35c2cf4637a9fc0 from "192.168.1.2:870"
Nov 20 18:36:46 nas systemd[1]: Condition check resulted in Kernel Module supporting RPCSEC_GSS being skipped.
Nov 20 18:36:46 nas systemd[1]: zfs-share.service: Deactivated successfully.
Nov 20 18:36:46 nas systemd[1]: Stopped ZFS file system shares.
Nov 20 18:36:46 nas systemd[1]: Stopping ZFS file system shares...
Nov 20 18:36:46 nas systemd[1]: Stopping NFS server and services...
Nov 20 18:36:46 nas kernel: [ 6750.760929] nfsd: last server has exited, flushing export cache
Nov 20 18:36:46 nas systemd[1]: nfs-server.service: Deactivated successfully.
Nov 20 18:36:46 nas systemd[1]: Stopped NFS server and services.
Nov 20 18:36:46 nas systemd[1]: Stopping NFSv4 ID-name mapping service...
Nov 20 18:36:46 nas rpc.mountd[1601]: Caught signal 15, un-registering and exiting.
Nov 20 18:36:46 nas systemd[1]: Stopping NFS Mount Daemon...
Nov 20 18:36:46 nas rpc.idmapd[1260]: exiting on signal 15
Nov 20 18:36:46 nas systemd[1]: Condition check resulted in RPC security service for NFS client and server being skipped.
Nov 20 18:36:46 nas systemd[1]: Condition check resulted in RPC security service for NFS server being skipped.
Nov 20 18:36:46 nas systemd[1]: nfs-idmapd.service: Main process exited, code=exited, status=1/FAILURE
Nov 20 18:36:46 nas systemd[1]: nfs-idmapd.service: Failed with result 'exit-code'.
Nov 20 18:36:46 nas systemd[1]: Stopped NFSv4 ID-name mapping service.
Nov 20 18:36:46 nas systemd[1]: nfs-mountd.service: Deactivated successfully.
Nov 20 18:36:46 nas systemd[1]: Stopped NFS Mount Daemon.
Nov 20 18:36:46 nas systemd[1]: Starting NFSv4 ID-name mapping service...
Nov 20 18:36:46 nas systemd[1]: Starting NFS Mount Daemon...
Nov 20 18:36:46 nas rpc.idmapd[470621]: Setting log level to 0
Nov 20 18:36:46 nas systemd[1]: Started NFSv4 ID-name mapping service.
Nov 20 18:36:46 nas rpc.mountd[470622]: Version 2.6.1 starting
Nov 20 18:36:46 nas systemd[1]: Started NFS Mount Daemon.
Nov 20 18:36:46 nas systemd[1]: Starting NFS server and services...
Nov 20 18:36:46 nas kernel: [ 6750.934885] NFSD: Using nfsdcld client tracking operations.
Nov 20 18:36:46 nas kernel: [ 6750.934888] NFSD: no clients to reclaim, skipping NFSv4 grace period (net f0000000)
Nov 20 18:36:46 nas systemd[1]: Finished NFS server and services.
Nov 20 18:36:46 nas systemd[1]: Starting ZFS file system shares...
Nov 20 18:36:46 nas systemd[1]: Finished ZFS file system shares.
Nov 20 18:37:02 nas rpc.mountd[470622]: authenticated mount request from 192.168.1.2:800 for /mnt/cached (/mnt/cached)
Nov 20 18:38:32 nas pcp-pmie[2510]: High aggregate context switch rate 41942ctxsw/s@nas

Seems like its triggered here

Nov 20 18:35:36 nas hd-idle[1328]: sde spinup
Nov 20 18:36:09 nas systemd[1]: mnt-cached.mount: Deactivated successfully.
Nov 20 18:36:09 nas systemd[1]: mnt-cached.mount: Unit process 540 (mergerfs) remains running after unit stopped.
Nov 20 18:36:09 nas systemd[1]: mnt-cached.mount: Consumed 6min 1.120s CPU time.
Nov 20 18:36:31 nas rpc.mountd[1601]: v4.0 client detached: (null) from (null)

TheLinuxGuy · 2022-11-21T01:11:00Z

TheLinuxGuy
Nov 21, 2022
Author

The odd things is that it keeps crashing.

root@nas:/home/gfm# systemctl status mnt-cached.mount
● mnt-cached.mount - /mnt/cached
     Loaded: loaded (/etc/fstab; generated)
     Active: active (mounted) since Sun 2022-11-20 18:50:19 EST; 1h 17min ago
      Where: /mnt/cached
       What: mfs-cache
       Docs: man:fstab(5)
             man:systemd-fstab-generator(8)
      Tasks: 6 (limit: 9362)
     Memory: 27.4M
        CPU: 4min 32.478s
     CGroup: /system.slice/mnt-cached.mount
             └─562 mergerfs /cache:/mnt/slow-storage /mnt/cached -o rw,nonempty,allow_other,use_ino,no>

Nov 20 18:50:19 nas systemd[1]: Mounting /mnt/cached...
Nov 20 18:50:19 nas systemd[1]: Mounted /mnt/cached.
root@nas:/home/gfm# ls /mnt/cached
ls: cannot access '/mnt/cached': Input/output error

systemctl restart mnt-cached.mount recovers it, apparently without requiring NFS service restart.

edit 11/21, some kernel error show in logs. But this could be by FUSE mount no longer working?

root@nas:/home/gfm# grep 'Nov 21 01:29:21' -A 100 /var/log/syslog
Nov 21 01:29:21 nas systemd[1]: mnt-cached.mount: Deactivated successfully.
Nov 21 01:29:21 nas systemd[1]: mnt-cached.mount: Unit process 529 (mergerfs) remains running after unit stopped.
Nov 21 01:29:21 nas systemd[1]: mnt-cached.mount: Consumed 8min 52.512s CPU time.
Nov 21 01:29:21 nas kernel: [15619.679654] ------------[ cut here ]------------
Nov 21 01:29:21 nas kernel: [15619.679655] nfsd: non-standard errno: -107
Nov 21 01:29:21 nas kernel: [15619.679686] WARNING: CPU: 0 PID: 52573 at fs/nfsd/nfsproc.c:886 nfserrno+0x89/0xb0 [nfsd]
Nov 21 01:29:21 nas kernel: [15619.679711] Modules linked in: rpcsec_gss_krb5 wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel nfnetlink_acct cfg80211 xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay binfmt_misc xfs nls_iso8859_1 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) input_leds serio_raw joydev snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd soundcore mac_hid qemu_fw_cfg dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel nfsd ramoops pstore_blk msr reed_solomon pstore_zone auth_rpcgss nfs_acl lockd efi_pstore grace sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq
Nov 21 01:29:21 nas kernel: [15619.679735]  async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic bochs drm_vram_helper usbhid drm_ttm_helper hid ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops i2c_i801 virtio_net cec net_failover rc_core ahci aesni_intel crypto_simd cryptd libahci i2c_smbus psmouse mpt3sas nvme raid_class lpc_ich scsi_transport_sas nvme_core failover virtio_scsi drm
Nov 21 01:29:21 nas kernel: [15619.679753] CPU: 0 PID: 52573 Comm: nfsd Tainted: P           O      5.15.0-53-generic #59-Ubuntu
Nov 21 01:29:21 nas kernel: [15619.679755] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Nov 21 01:29:21 nas kernel: [15619.679756] RIP: 0010:nfserrno+0x89/0xb0 [nfsd]
Nov 21 01:29:21 nas kernel: [15619.679763] Code: 01 0f 87 14 a5 03 00 83 e3 01 b8 00 00 00 05 75 d7 44 89 e6 48 c7 c7 64 4a 89 c0 89 45 e4 c6 05 ac da 05 00 01 e8 ce a2 ea d1 <0f> 0b 8b 45 e4 eb b7 48 c7 c7 00 58 8a c0 e8 44 90 83 d1 eb 91 4c
Nov 21 01:29:21 nas kernel: [15619.679764] RSP: 0018:ffffb2230bc3bc20 EFLAGS: 00010282
Nov 21 01:29:21 nas kernel: [15619.679765] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027
Nov 21 01:29:21 nas kernel: [15619.679765] RDX: ffff9d62b7c20588 RSI: 0000000000000001 RDI: ffff9d62b7c20580
Nov 21 01:29:21 nas kernel: [15619.679766] RBP: ffffb2230bc3bc40 R08: 0000000000000003 R09: 61646e6174732d6e
Nov 21 01:29:21 nas kernel: [15619.679766] R10: 7272652064726164 R11: 3730312d203a6f6e R12: 00000000ffffff95
Nov 21 01:29:21 nas kernel: [15619.679767] R13: 0000000000000022 R14: ffff9d6150ebeb00 R15: ffff9d6142588030
Nov 21 01:29:21 nas kernel: [15619.679767] FS:  0000000000000000(0000) GS:ffff9d62b7c00000(0000) knlGS:0000000000000000
Nov 21 01:29:21 nas kernel: [15619.679768] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 21 01:29:21 nas kernel: [15619.679769] CR2: 00007facb651a0c0 CR3: 000000010e660000 CR4: 00000000000006f0
Nov 21 01:29:21 nas kernel: [15619.679771] Call Trace:
Nov 21 01:29:21 nas kernel: [15619.679771]  <TASK>
Nov 21 01:29:21 nas kernel: [15619.679773]  nfsd_finish_read+0x32/0x80 [nfsd]
Nov 21 01:29:21 nas kernel: [15619.679780]  nfsd_splice_read+0x95/0x100 [nfsd]
Nov 21 01:29:21 nas kernel: [15619.679788]  nfsd4_encode_splice_read+0x62/0x180 [nfsd]
Nov 21 01:29:21 nas kernel: [15619.679803]  nfsd4_encode_read+0x21b/0x270 [nfsd]
Nov 21 01:29:21 nas kernel: [15619.679810]  nfsd4_encode_operation+0xc2/0x2c0 [nfsd]
Nov 21 01:29:21 nas kernel: [15619.679817]  nfsd4_proc_compound+0x31a/0x750 [nfsd]
Nov 21 01:29:21 nas kernel: [15619.679824]  ? nfsd_cache_lookup+0x3b7/0x4a0 [nfsd]
Nov 21 01:29:21 nas kernel: [15619.679831]  nfsd_dispatch+0x163/0x260 [nfsd]
Nov 21 01:29:21 nas kernel: [15619.679838]  svc_process_common+0x3da/0x720 [sunrpc]
Nov 21 01:29:21 nas kernel: [15619.679858]  ? nfsd_svc+0x190/0x190 [nfsd]
Nov 21 01:29:21 nas kernel: [15619.679865]  svc_process+0xbc/0x100 [sunrpc]
Nov 21 01:29:21 nas kernel: [15619.679882]  nfsd+0xed/0x150 [nfsd]
Nov 21 01:29:21 nas kernel: [15619.679889]  ? nfsd_shutdown_threads+0x90/0x90 [nfsd]
Nov 21 01:29:21 nas kernel: [15619.679895]  kthread+0x12a/0x150
Nov 21 01:29:21 nas kernel: [15619.679898]  ? set_kthread_struct+0x50/0x50
Nov 21 01:29:21 nas kernel: [15619.679899]  ret_from_fork+0x22/0x30
Nov 21 01:29:21 nas kernel: [15619.679901]  </TASK>
Nov 21 01:29:21 nas kernel: [15619.679901] ---[ end trace a5abc8e072ded414 ]---
Nov 21 01:29:27 nas systemd[1]: Condition check resulted in Kernel Module supporting RPCSEC_GSS being skipped.
Nov 21 01:29:27 nas systemd[1]: Stopping NFS server and services...
Nov 21 01:29:27 nas kernel: [15625.867277] nfsd: last server has exited, flushing export cache
Nov 21 01:29:27 nas systemd[1]: nfs-server.service: Deactivated successfully.
Nov 21 01:29:27 nas systemd[1]: Stopped NFS server and services.
Nov 21 01:29:27 nas systemd[1]: Stopping NFSv4 ID-name mapping service...
Nov 21 01:29:27 nas systemd[1]: Stopping NFS Mount Daemon...
Nov 21 01:29:27 nas systemd[1]: Condition check resulted in RPC security service for NFS client and server being skipped.
Nov 21 01:29:27 nas rpc.mountd[52485]: Caught signal 15, un-registering and exiting.
Nov 21 01:29:27 nas rpc.idmapd[52484]: exiting on signal 15
Nov 21 01:29:27 nas systemd[1]: nfs-mountd.service: Deactivated successfully.
Nov 21 01:29:27 nas systemd[1]: Stopped NFS Mount Daemon.
Nov 21 01:29:27 nas systemd[1]: nfs-idmapd.service: Main process exited, code=exited, status=1/FAILURE
Nov 21 01:29:27 nas systemd[1]: nfs-idmapd.service: Failed with result 'exit-code'.
Nov 21 01:29:27 nas systemd[1]: Stopped NFSv4 ID-name mapping service.
Nov 21 01:29:27 nas systemd[1]: Stopped target Local File Systems.
Nov 21 01:29:27 nas systemd[1]: Stopping Local File Systems...
Nov 21 01:29:27 nas systemd[1]: Mounting /mnt/cached...
Nov 21 01:29:27 nas systemd[1]: Condition check resulted in File System Check on Root Device being skipped.
Nov 21 01:29:27 nas systemd[1]: Mounted /mnt/cached.
Nov 21 01:29:27 nas systemd[1]: Reached target Local File Systems.
Nov 21 01:29:27 nas systemd[1]: Starting NFSv4 ID-name mapping service...
Nov 21 01:29:27 nas systemd[1]: Starting NFS Mount Daemon...
Nov 21 01:29:27 nas systemd[1]: Condition check resulted in RPC security service for NFS server being skipped.
Nov 21 01:29:27 nas rpc.idmapd[824798]: Setting log level to 0
Nov 21 01:29:27 nas rpc.mountd[824799]: Version 2.6.1 starting
Nov 21 01:29:27 nas systemd[1]: Started NFS Mount Daemon.
Nov 21 01:29:27 nas systemd[1]: Started NFSv4 ID-name mapping service.
Nov 21 01:29:27 nas systemd[1]: Starting NFS server and services...
Nov 21 01:29:27 nas systemd[1]: Finished NFS server and services.
Nov 21 01:29:27 nas kernel: [15626.059263] NFSD: Using nfsdcld client tracking operations.
Nov 21 01:29:27 nas kernel: [15626.059265] NFSD: starting 90-second grace period (net f0000000)
Nov 21 01:29:30 nas rpc.mountd[824799]: v4.2 client attached: 0xeb9f98a7637b1ac7 from "192.168.1.254:740"
Nov 21 01:29:30 nas kernel: [15628.862756] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000000)
Nov 21 01:30:01 nas CRON[824894]: (root) CMD (/home/gfm/unraid-mover.sh)
Nov 21 01:30:03 nas systemd[1]: session-12.scope: Deactivated successfully.
Nov 21 01:30:26 nas CRON[824892]: (CRON) info (No MTA installed, discarding output)
Nov 21 01:31:20 nas pcp-pmie[2701]: High aggregate context switch rate 26519ctxsw/s@nas
Nov 21 01:33:34 nas hd-idle[1316]: sdb spindown
Nov 21 01:34:34 nas hd-idle[1316]: sde spindown
Nov 21 01:35:01 nas CRON[825246]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Nov 21 01:41:20 nas pcp-pmie[2701]: High aggregate context switch rate 23527ctxsw/s@nas
Nov 21 01:45:01 nas CRON[825518]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Nov 21 01:45:01 nas CRON[825519]: (root) CMD (/home/gfm/unraid-mover.sh)
Nov 21 01:45:02 nas CRON[825517]: (CRON) info (No MTA installed, discarding output)
Nov 21 01:45:34 nas hd-idle[1316]: sdb spinup

0 replies

TheLinuxGuy · 2022-11-21T19:29:03Z

TheLinuxGuy
Nov 21, 2022
Author

Just to rule out an issue with ZFS 2.1.3 (stable on ubuntu 10.04) I have upgraded it to ZFS 2.1.6. Also upgraded nfs (2.6.1-1ubuntu1.2) from
2.6.1-1ubuntu1.1.

It's odd that out of my two mergerfs pools the one that keeps crashing is the one which has ZFS+mergerfs_slow_disks - but the later never crashes.

BUT I am not sure this is the root cause because if the primary branch which is ZFS (/cache) were to fail, I think mergerfs would just handle the requests towards the secondary branch (/mnt/slow-storage) which is the other mergerfs pool that's been stable.

The system hang continue INFO: task mergerfs:535 blocked for more than 241 seconds.

dmesg

[  758.884601] NFSD: Using nfsdcld client tracking operations.
[  758.884602] NFSD: starting 90-second grace period (net f0000000)
[  760.367615] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000000)
[12204.530990] INFO: task mergerfs:535 blocked for more than 120 seconds.
[12204.531075]       Tainted: P           O      5.15.0-53-generic #59-Ubuntu
[12204.531135] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12204.531199] task:mergerfs        state:D stack:    0 pid:  535 ppid:     1 flags:0x00000002
[12204.531209] Call Trace:
[12204.531214]  <TASK>
[12204.531221]  __schedule+0x24e/0x590
[12204.531237]  ? qgroup_reserve+0xdd/0x2a0 [btrfs]
[12204.531398]  schedule+0x69/0x110
[12204.531405]  wait_current_trans+0xda/0x140 [btrfs]
[12204.531512]  ? wait_woken+0x70/0x70
[12204.531520]  start_transaction+0x4c5/0x5b0 [btrfs]
[12204.531621]  btrfs_start_transaction_fallback_global_rsv+0x1b/0x30 [btrfs]
[12204.531720]  btrfs_unlink+0x38/0x110 [btrfs]
[12204.531822]  vfs_unlink+0x126/0x290
[12204.531830]  do_unlinkat+0x19b/0x2c0
[12204.531837]  __x64_sys_unlink+0x42/0x70
[12204.531844]  ? syscall_exit_to_user_mode+0x27/0x50
[12204.531852]  do_syscall_64+0x5c/0xc0
[12204.531857]  ? exit_to_user_mode_prepare+0x37/0xb0
[12204.531865]  ? syscall_exit_to_user_mode+0x27/0x50
[12204.531871]  ? __x64_sys_newlstat+0x16/0x20
[12204.531879]  ? do_syscall_64+0x69/0xc0
[12204.531884]  ? syscall_exit_to_user_mode+0x27/0x50
[12204.531890]  ? __x64_sys_fstatfs+0x15/0x20
[12204.531896]  ? do_syscall_64+0x69/0xc0
[12204.531901]  ? do_syscall_64+0x69/0xc0
[12204.531905]  ? do_syscall_64+0x69/0xc0
[12204.531909]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[12204.531917] RIP: 0033:0x7f078239925b
[12204.531928] RSP: 002b:00007f0781780308 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
[12204.531935] RAX: ffffffffffffffda RBX: 00007f07817803a0 RCX: 00007f078239925b
[12204.531939] RDX: 0000000000000067 RSI: 00007f07780542d0 RDI: 00007f077001d470
[12204.531942] RBP: 00007f07780542d0 R08: 000000000000005d R09: 0000000000000000
[12204.531945] R10: 00007f0778002c40 R11: 0000000000000246 R12: 000000000000000a
[12204.531948] R13: 00007f07700c1390 R14: 00007f0781780340 R15: 00007f0781780390
[12204.531956]  </TASK>
[12204.531959] INFO: task mergerfs:536 blocked for more than 120 seconds.
[12204.532015]       Tainted: P           O      5.15.0-53-generic #59-Ubuntu
[12204.532073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12204.532136] task:mergerfs        state:D stack:    0 pid:  536 ppid:     1 flags:0x00000002
[12204.532143] Call Trace:
[12204.532145]  <TASK>
[12204.532148]  __schedule+0x24e/0x590
[12204.532154]  ? qgroup_reserve+0xdd/0x2a0 [btrfs]
[12204.532298]  schedule+0x69/0x110
[12204.532304]  wait_current_trans+0xda/0x140 [btrfs]
[12204.532519]  ? wait_woken+0x70/0x70
[12204.532530]  start_transaction+0x4c5/0x5b0 [btrfs]
[12204.532604]  btrfs_start_transaction_fallback_global_rsv+0x1b/0x30 [btrfs]
[12204.532667]  btrfs_unlink+0x38/0x110 [btrfs]
[12204.532733]  vfs_unlink+0x126/0x290
[12204.532740]  elfcorehdr_read+0x40/0x40
[12204.532746]  __x64_sys_unlink+0x42/0x70
[12204.532751]  do_syscall_64+0x5c/0xc0
[12204.532759]  ? syscall_exit_to_user_mode+0x27/0x50
[12204.532765]  ? __x64_sys_statfs+0x16/0x20
[12204.532771]  ? do_syscall_64+0x69/0xc0
[12204.532774]  ? irqentry_exit+0x1d/0x30
[12204.532779]  ? exc_page_fault+0x89/0x170
[12204.532784]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[12204.532792] RIP: 0033:0x7f078239925b
[12204.532803] RSP: 002b:00007f0780e7d308 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
[12204.532809] RAX: ffffffffffffffda RBX: 00007f0780e7d3a0 RCX: 00007f078239925b
[12204.532812] RDX: 0000000000000092 RSI: 00007f07780a2306 RDI: 00007f0778048430
[12204.532814] RBP: 00007f07780a2340 R08: 0000000000000000 R09: 0000000000000000
[12204.532817] R10: 00007f0778002c40 R11: 0000000000000246 R12: 000000000000000a
[12204.532819] R13: 00007f0778017240 R14: 00007f0780e7d340 R15: 00007f0780e7d390
[12204.532825]  </TASK>
[12204.532830] INFO: task mergerfs:537 blocked for more than 120 seconds.
[12204.532878]       Tainted: P           O      5.15.0-53-generic #59-Ubuntu
[12204.532911] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12204.532947] task:mergerfs        state:D stack:    0 pid:  537 ppid:     1 flags:0x00004002
[12204.532955] Call Trace:
[12204.532958]  <TASK>
[12204.532960]  __schedule+0x24e/0x590
[12204.532967]  schedule+0x69/0x110
[12204.532971]  wait_current_trans+0xda/0x140 [btrfs]
[12204.533035]  ? wait_woken+0x70/0x70
[12204.533041]  start_transaction+0x375/0x5b0 [btrfs]
[12204.533101]  btrfs_join_transaction+0x1d/0x30 [btrfs]
[12204.533160]  btrfs_dirty_inode+0x5b/0xf0 [btrfs]
[12204.533220]  btrfs_update_time+0x88/0xf0 [btrfs]
[12204.533277]  touch_atime+0xc7/0x150
[12204.533284]  filemap_read+0x399/0x3e0
[12204.533289]  ? wait_woken+0x70/0x70
[12204.533294]  ? fuse_dev_read+0x59/0x90
[12204.533301]  ? vfs_read+0x1/0x1a0
[12204.533306]  btrfs_file_read_iter+0x57/0x70 [btrfs]
[12204.533372]  new_sync_read+0x10d/0x190
[12204.533376]  vfs_read+0x103/0x1a0
[12204.533379]  __x64_sys_pread64+0x96/0xc0
[12204.533383]  do_syscall_64+0x5c/0xc0
[12204.533387]  ? syscall_exit_to_user_mode+0x27/0x50
[12204.533392]  ? __x64_sys_writev+0x1c/0x30
[12204.533396]  ? do_syscall_64+0x69/0xc0
[12204.533399]  ? do_syscall_64+0x69/0xc0
[12204.533402]  ? syscall_exit_to_user_mode+0x27/0x50
[12204.533407]  ? __x64_sys_writev+0x1c/0x30
[12204.533410]  ? do_syscall_64+0x69/0xc0
[12204.533413]  ? do_syscall_64+0x69/0xc0
[12204.533416]  ? common_interrupt+0x55/0xa0
[12204.533420]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[12204.533426] RIP: 0033:0x7f078239593f
[12204.533430] RSP: 002b:00007f078057a140 EFLAGS: 00000202 ORIG_RAX: 0000000000000011
[12204.533433] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f078239593f
[12204.533436] RDX: 0000000000001000 RSI: 00007f07740d5000 RDI: 0000000000000006
[12204.533438] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[12204.533440] R10: 000000007e7ff000 R11: 0000000000000202 R12: 00007f078057a328
[12204.533442] R13: 0000000000000000 R14: 00007f0774002ce8 R15: 0000000000000000
[12204.533446]  </TASK>
[12204.533515] INFO: task kworker/u8:7:31340 blocked for more than 120 seconds.
[12204.533550]       Tainted: P           O      5.15.0-53-generic #59-Ubuntu
[12204.533582] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12204.533616] task:kworker/u8:7    state:D stack:    0 pid:31340 ppid:     2 flags:0x00004000
[12204.533623] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[12204.533714] Call Trace:
[12204.533716]  <TASK>
[12204.533717]  __schedule+0x24e/0x590
[12204.533723]  schedule+0x69/0x110
[12204.533727]  wait_current_trans+0xda/0x140 [btrfs]
[12204.533788]  ? wait_woken+0x70/0x70
[12204.533793]  start_transaction+0x375/0x5b0 [btrfs]
[12204.533852]  btrfs_join_transaction+0x1d/0x30 [btrfs]
[12204.533908]  btrfs_finish_ordered_io.isra.0+0x3c7/0x9e0 [btrfs]
[12204.533969]  ? psi_task_switch+0x1eb/0x220
[12204.533974]  finish_ordered_fn+0x15/0x20 [btrfs]
[12204.534030]  btrfs_work_helper+0xd4/0x190 [btrfs]
[12204.534102]  process_one_work+0x22b/0x3d0
[12204.534107]  worker_thread+0x53/0x420
[12204.534110]  ? process_one_work+0x3d0/0x3d0
[12204.534113]  kthread+0x12a/0x150
[12204.534118]  ? set_kthread_struct+0x50/0x50
[12204.534124]  ret_from_fork+0x22/0x30
[12204.534132]  </TASK>
[12204.534134] INFO: task kworker/u8:6:51038 blocked for more than 120 seconds.
[12204.534167]       Tainted: P           O      5.15.0-53-generic #59-Ubuntu
[12204.534198] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12204.534234] task:kworker/u8:6    state:D stack:    0 pid:51038 ppid:     2 flags:0x00004000
[12204.534238] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[12204.534307] Call Trace:
[12204.534308]  <TASK>
[12204.534310]  __schedule+0x24e/0x590
[12204.534315]  schedule+0x69/0x110
[12204.534319]  wait_current_trans+0xda/0x140 [btrfs]
[12204.534378]  ? wait_woken+0x70/0x70
[12204.534382]  start_transaction+0x375/0x5b0 [btrfs]
[12204.534442]  btrfs_join_transaction+0x1d/0x30 [btrfs]
[12204.534499]  btrfs_finish_ordered_io.isra.0+0x3c7/0x9e0 [btrfs]
[12204.534559]  ? psi_task_switch+0xc6/0x220
[12204.534563]  finish_ordered_fn+0x15/0x20 [btrfs]
[12204.534619]  btrfs_work_helper+0xd4/0x190 [btrfs]
[12204.534687]  process_one_work+0x22b/0x3d0
[12204.534690]  worker_thread+0x53/0x420
[12204.534693]  ? process_one_work+0x3d0/0x3d0
[12204.534696]  kthread+0x12a/0x150
[12204.534701]  ? set_kthread_struct+0x50/0x50
[12204.534706]  ret_from_fork+0x22/0x30
[12204.534713]  </TASK>
[12204.534714] INFO: task kworker/u8:4:82512 blocked for more than 120 seconds.
[12204.534746]       Tainted: P           O      5.15.0-53-generic #59-Ubuntu
[12204.534777] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12204.534914] task:kworker/u8:4    state:D stack:    0 pid:82512 ppid:     2 flags:0x00004000
[12204.534920] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[12204.534985] Call Trace:
[12204.534987]  <TASK>
[12204.534989]  __schedule+0x24e/0x590
[12204.534994]  schedule+0x69/0x110
[12204.534997]  wait_current_trans+0xda/0x140 [btrfs]
[12204.535058]  ? wait_woken+0x70/0x70
[12204.535063]  start_transaction+0x375/0x5b0 [btrfs]
[12204.535121]  btrfs_join_transaction+0x1d/0x30 [btrfs]
[12204.535178]  btrfs_finish_ordered_io.isra.0+0x3c7/0x9e0 [btrfs]
[12204.535237]  ? psi_task_switch+0xc6/0x220
[12204.535242]  finish_ordered_fn+0x15/0x20 [btrfs]
[12204.535298]  btrfs_work_helper+0xd4/0x190 [btrfs]
[12204.535365]  process_one_work+0x22b/0x3d0
[12204.535368]  worker_thread+0x53/0x420
[12204.535371]  ? process_one_work+0x3d0/0x3d0
[12204.535374]  kthread+0x12a/0x150
[12204.535379]  ? set_kthread_struct+0x50/0x50
[12204.535384]  ret_from_fork+0x22/0x30
[12204.535391]  </TASK>
[12204.535396] INFO: task kworker/u8:2:103201 blocked for more than 120 seconds.
[12204.535430]       Tainted: P           O      5.15.0-53-generic #59-Ubuntu
[12204.535462] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12204.535495] task:kworker/u8:2    state:D stack:    0 pid:103201 ppid:     2 flags:0x00004000
[12204.535500] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[12204.535568] Call Trace:
[12204.535570]  <TASK>
[12204.535571]  __schedule+0x24e/0x590
[12204.535576]  schedule+0x69/0x110
[12204.535580]  wait_current_trans+0xda/0x140 [btrfs]
[12204.535645]  ? wait_woken+0x70/0x70
[12204.535650]  start_transaction+0x375/0x5b0 [btrfs]
[12204.535712]  btrfs_join_transaction+0x1d/0x30 [btrfs]
[12204.535772]  btrfs_finish_ordered_io.isra.0+0x3c7/0x9e0 [btrfs]
[12204.535835]  ? psi_task_switch+0xc6/0x220
[12204.535840]  finish_ordered_fn+0x15/0x20 [btrfs]
[12204.535900]  btrfs_work_helper+0xd4/0x190 [btrfs]
[12204.535966]  process_one_work+0x22b/0x3d0
[12204.535969]  worker_thread+0x53/0x420
[12204.535972]  ? process_one_work+0x3d0/0x3d0
[12204.535975]  kthread+0x12a/0x150
[12204.535980]  ? set_kthread_struct+0x50/0x50
[12204.535985]  ret_from_fork+0x22/0x30
[12204.535992]  </TASK>
[12204.535994] INFO: task kworker/u8:3:106678 blocked for more than 120 seconds.
[12204.536027]       Tainted: P           O      5.15.0-53-generic #59-Ubuntu
[12204.536058] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12204.536092] task:kworker/u8:3    state:D stack:    0 pid:106678 ppid:     2 flags:0x00004000
[12204.536096] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[12204.536160] Call Trace:
[12204.536163]  <TASK>
[12204.536165]  __schedule+0x24e/0x590
[12204.536171]  schedule+0x69/0x110
[12204.536175]  wait_current_trans+0xda/0x140 [btrfs]
[12204.536236]  ? wait_woken+0x70/0x70
[12204.536241]  start_transaction+0x375/0x5b0 [btrfs]
[12204.536320]  btrfs_join_transaction+0x1d/0x30 [btrfs]
[12204.536376]  btrfs_finish_ordered_io.isra.0+0x3c7/0x9e0 [btrfs]
[12204.536431]  ? psi_task_switch+0xc6/0x220
[12204.536436]  finish_ordered_fn+0x15/0x20 [btrfs]
[12204.536489]  btrfs_work_helper+0xd4/0x190 [btrfs]
[12204.536548]  process_one_work+0x22b/0x3d0
[12204.536552]  worker_thread+0x53/0x420
[12204.536554]  ? process_one_work+0x3d0/0x3d0
[12204.536557]  kthread+0x12a/0x150
[12204.536562]  ? set_kthread_struct+0x50/0x50
[12204.536566]  ret_from_fork+0x22/0x30
[12204.536573]  </TASK>
[12204.536575] INFO: task kworker/u8:5:107216 blocked for more than 120 seconds.
[12204.536607]       Tainted: P           O      5.15.0-53-generic #59-Ubuntu
[12204.536637] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12204.536670] task:kworker/u8:5    state:D stack:    0 pid:107216 ppid:     2 flags:0x00004000
[12204.536674] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[12204.536730] Call Trace:
[12204.536731]  <TASK>
[12204.536733]  __schedule+0x24e/0x590
[12204.536737]  schedule+0x69/0x110
[12204.536741]  wait_current_trans+0xda/0x140 [btrfs]
[12204.536797]  ? wait_woken+0x70/0x70
[12204.536801]  start_transaction+0x375/0x5b0 [btrfs]
[12204.536855]  btrfs_join_transaction+0x1d/0x30 [btrfs]
[12204.536909]  btrfs_finish_ordered_io.isra.0+0x3c7/0x9e0 [btrfs]
[12204.536965]  ? psi_task_switch+0xc6/0x220
[12204.536969]  finish_ordered_fn+0x15/0x20 [btrfs]
[12204.537023]  btrfs_work_helper+0xd4/0x190 [btrfs]
[12204.537081]  process_one_work+0x22b/0x3d0
[12204.537085]  worker_thread+0x53/0x420
[12204.537087]  ? process_one_work+0x3d0/0x3d0
[12204.537090]  kthread+0x12a/0x150
[12204.537095]  ? set_kthread_struct+0x50/0x50
[12204.537100]  ret_from_fork+0x22/0x30
[12204.537106]  </TASK>
[12325.362136] INFO: task mergerfs:535 blocked for more than 241 seconds.
[12325.362186]       Tainted: P           O      5.15.0-53-generic #59-Ubuntu
[12325.362217] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12325.362248] task:mergerfs        state:D stack:    0 pid:  535 ppid:     1 flags:0x00000002
[12325.362256] Call Trace:
[12325.362259]  <TASK>
[12325.362264]  __schedule+0x24e/0x590
[12325.362274]  ? qgroup_reserve+0xdd/0x2a0 [btrfs]
[12325.362360]  schedule+0x69/0x110
[12325.362364]  wait_current_trans+0xda/0x140 [btrfs]
[12325.362412]  ? wait_woken+0x70/0x70
[12325.362417]  start_transaction+0x4c5/0x5b0 [btrfs]
[12325.362462]  btrfs_start_transaction_fallback_global_rsv+0x1b/0x30 [btrfs]
[12325.362503]  btrfs_unlink+0x38/0x110 [btrfs]
[12325.362547]  vfs_unlink+0x126/0x290
[12325.362551]  do_unlinkat+0x19b/0x2c0
[12325.362556]  __x64_sys_unlink+0x42/0x70
[12325.362559]  ? syscall_exit_to_user_mode+0x27/0x50
[12325.362564]  do_syscall_64+0x5c/0xc0
[12325.362567]  ? exit_to_user_mode_prepare+0x37/0xb0
[12325.362571]  ? syscall_exit_to_user_mode+0x27/0x50
[12325.362574]  ? __x64_sys_newlstat+0x16/0x20
[12325.362579]  ? do_syscall_64+0x69/0xc0
[12325.362581]  ? syscall_exit_to_user_mode+0x27/0x50
[12325.362585]  ? __x64_sys_fstatfs+0x15/0x20
[12325.362589]  ? do_syscall_64+0x69/0xc0
[12325.362591]  ? do_syscall_64+0x69/0xc0
[12325.362594]  ? do_syscall_64+0x69/0xc0
[12325.362596]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[12325.362601] RIP: 0033:0x7f078239925b
[12325.362612] RSP: 002b:00007f0781780308 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
[12325.362617] RAX: ffffffffffffffda RBX: 00007f07817803a0 RCX: 00007f078239925b
[12325.362619] RDX: 0000000000000067 RSI: 00007f07780542d0 RDI: 00007f077001d470
[12325.362621] RBP: 00007f07780542d0 R08: 000000000000005d R09: 0000000000000000
[12325.362622] R10: 00007f0778002c40 R11: 0000000000000246 R12: 000000000000000a
[12325.362624] R13: 00007f07700c1390 R14: 00007f0781780340 R15: 00007f0781780390
[12325.362629]  </TASK>

stack

root@nas:/home/gfm# dpkg -l | grep fuse
rc  fuse                                       2.9.9-3                                 amd64        Filesystem in Userspace
ii  fuse3                                      3.10.5-1build1                          amd64        Filesystem in Userspace (3.x version)
ii  libfuse3-3:amd64                           3.10.5-1build1                          amd64        Filesystem in Userspace (library) (3.x version)
root@nas:/home/gfm#

These crashes seem to be occurring more frequenty when NFS server is configured to operate in NFSv4 mode only. Here was the setup I had done for v4.2 force: https://github.com/TheLinuxGuy/free-unraid/blob/main/miscelaneous_sysadmin.md#nfs

2 replies

trapexit Nov 23, 2022
Maintainer

INFO: task mergerfs:535 blocked for more than 241 seconds.

This isn't mergerfs' fault. As the message says mergerfs itself is blocked. Meaning the filesystem/device is holding it up. Or the kernel. Likely means bad hardware.

And mergerfs doesn't use system libfuse so non of those libraries matter.

TheLinuxGuy Nov 23, 2022
Author

INFO: task mergerfs:535 blocked for more than 241 seconds.

This isn't mergerfs' fault. As the message says mergerfs itself is blocked. Meaning the filesystem/device is holding it up. Or the kernel. Likely means bad hardware.

And mergerfs doesn't use system libfuse so non of those libraries matter.

Thanks for help clarifying what that meant.

I have a theory: ZFS + (non-zfs native) NFS make the ZFS mountpoint unresponsive or stuck, perhaps in the middle of mergerfs writes or file moves. This leads to the nfsd kernel messages and the filesystem doesn't recover unless its force unmounted and restarted the systemctl mount unit.

I have upgraded all the software (kernel, nfs, zfs) so I don't think this instability has been fixed and its a bug in some non-mergerfs place.

To see if I can get my system's NFS mounts to be stable - I destroyed my ZFS pool and converted it to mdadm XFS RAID1 mdadm --create --verbose /dev/md0 --bitmap=none --level=mirror --raid-devices=2 /dev/nvme0n1 /dev/sdb

I'll start doing the heavy I/O workloads via NFS from multiple clients and see if I can make the crash behavior occur - if things become stable then hooray for me. The bug is likely in NFS or the kernel... There must be a reason why ZFS has its on built-in NFS server functionality - I think I have been using the standalone/generic NFS server on my configuration since the ZFS mount exports are not set to NFS export (since it wouldn't be compatible with mergerfs)

TheLinuxGuy · 2022-11-22T17:42:58Z

TheLinuxGuy
Nov 22, 2022
Author

I'm going to try strace -fvTtt -s 256 -p PID -o /tmp/mergerfs.strace.txt and see where that gets me.

0 replies

trapexit · 2022-11-23T02:36:16Z

trapexit
Nov 23, 2022
Maintainer

If it's crashing then strace is pretty useless. Need a stack trace from gdb.

gdb path/to/mergerfs

run -f -o options branches mountpoint

when it crashes

thread apply all bt

7 replies

TheLinuxGuy Nov 26, 2022
Author

If it crashes gdb will catch it at the failure and you just need to run

Got it - sorry I never ran gdb before to debug.

When the /mnt/cached mount stops listing files the PID of mergerfs continues to run; as if its hung. For me to be able to run thread apply all bt the "(gdb)" command prompt needs to be visible on the session I presume? I did a quick run of the example command and noticed "(gdb)" disappears shortly after "run" commands and the process is spawned by gdb.

In cases of when mergerfs process doesn't die - would thread apply all bt still work? maybe I missed somewhere a trick to make the "(gdb)" command prompt appear again.

trapexit Nov 26, 2022
Maintainer

You're over thinking this. I've given you exactly what is needed when.

A debugger is an interactive thing. If you run the app then the interactive part stops because it is running. Like I mentioned when it crashes it will catch that and bring it back to the shell where you will input the command I mention to get the trace. There is no reason for you to break execution till that point because it's running normally.

trapexit Nov 26, 2022
Maintainer

If it's not dying then we're barking up the wrong tree and I'm not really sure what is going on here without more details or something reproducible.

TheLinuxGuy Nov 26, 2022
Author

If it's not dying then we're barking up the wrong tree and I'm not really sure what is going on here without more details or something reproducible.

Ok maybe I am missing something here; can you help me understand why this may not be a mergerfs binary issue? The mount that becomes inaccessible is created on the system by the mergerfs process.

The PID that results of that mount remains to be reported in ps aux when the mount no longer works; ls commands on the fuse.mergerfs mountpath report access issues - the issue is very similar to this other one #1004 (comment)

trapexit Nov 26, 2022
Maintainer

#1004 is mergerfs segfaulting (crashing). It is segfaulting because the kernel was asking a question it isn't supposed to ask so mergerfs had not been designed to respond.

If the process isn't exiting then it's not "crashing" in the normal sense and I don't know what's going on.

Unless you can describe to me a minimal example of the problem the best I can do is make vague suggestions based on common things.

TheLinuxGuy · 2022-11-23T09:01:48Z

TheLinuxGuy
Nov 23, 2022
Author

Even with XFS + mdadm the crashes / instability on NFS continue. Not sure if this says anything new or different from the past one but here it goes.

# grep 'Nov 23 03:55:48' -A 400 /var/log/syslog
Nov 23 03:55:48 nas pcp-pmie[2680]: Severe demand for real memory 8.5pgsout/s@nas
Nov 23 03:56:06 nas kernel: [ 5921.537775] INFO: task mergerfs:566 blocked for more than 120 seconds.
Nov 23 03:56:06 nas kernel: [ 5921.537833]       Tainted: P           O      5.15.0-53-generic #59-Ubuntu
Nov 23 03:56:06 nas kernel: [ 5921.537840] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 23 03:56:06 nas kernel: [ 5921.537847] task:mergerfs        state:D stack:    0 pid:  566 ppid:     1 flags:0x00004002
Nov 23 03:56:06 nas kernel: [ 5921.537850] Call Trace:
Nov 23 03:56:06 nas kernel: [ 5921.537850]  <TASK>
Nov 23 03:56:06 nas kernel: [ 5921.537853]  __schedule+0x24e/0x590
Nov 23 03:56:06 nas kernel: [ 5921.537856]  schedule+0x69/0x110
Nov 23 03:56:06 nas kernel: [ 5921.537857]  btrfs_start_ordered_extent+0xe2/0x120 [btrfs]
Nov 23 03:56:06 nas kernel: [ 5921.537880]  ? wait_woken+0x70/0x70
Nov 23 03:56:06 nas kernel: [ 5921.537882]  lock_and_cleanup_extent_if_need+0x198/0x1b0 [btrfs]
Nov 23 03:56:06 nas kernel: [ 5921.537897]  btrfs_buffered_write+0x29f/0x820 [btrfs]
Nov 23 03:56:06 nas kernel: [ 5921.537910]  ? pre_handler_kretprobe+0xaa/0x180
Nov 23 03:56:06 nas kernel: [ 5921.537912]  btrfs_file_write_iter+0x76/0x130 [btrfs]
Nov 23 03:56:06 nas kernel: [ 5921.537925]  new_sync_write+0x114/0x1a0
Nov 23 03:56:06 nas kernel: [ 5921.537927]  vfs_write+0x1d5/0x270
Nov 23 03:56:06 nas kernel: [ 5921.537927]  elfcorehdr_read+0x40/0x40
Nov 23 03:56:06 nas kernel: [ 5921.537928]  do_syscall_64+0x5c/0xc0
Nov 23 03:56:06 nas kernel: [ 5921.537930]  ? do_syscall_64+0x69/0xc0
Nov 23 03:56:06 nas kernel: [ 5921.537931]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Nov 23 03:56:06 nas kernel: [ 5921.537932] RIP: 0033:0x7fba874759ef
Nov 23 03:56:06 nas kernel: [ 5921.537935] RSP: 002b:00007fba8565a230 EFLAGS: 00000202 ORIG_RAX: 0000000000000012
Nov 23 03:56:06 nas kernel: [ 5921.537936] RAX: ffffffffffffffda RBX: 0000000000000060 RCX: 00007fba874759ef
Nov 23 03:56:06 nas kernel: [ 5921.537937] RDX: 0000000000000060 RSI: 00007fba85664060 RDI: 000000000000000b
Nov 23 03:56:06 nas kernel: [ 5921.537937] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000006
Nov 23 03:56:06 nas kernel: [ 5921.537938] R10: 000000008f8fffa0 R11: 0000000000000202 R12: 00007fba8565a548
Nov 23 03:56:06 nas kernel: [ 5921.537938] R13: 0000000000000000 R14: 00007fba8565a368 R15: 0000000000000000
Nov 23 03:56:06 nas kernel: [ 5921.537939]  </TASK>
Nov 23 03:56:06 nas kernel: [ 5921.537963] INFO: task nfsd:11223 blocked for more than 120 seconds.
Nov 23 03:56:06 nas kernel: [ 5921.537970]       Tainted: P           O      5.15.0-53-generic #59-Ubuntu
Nov 23 03:56:06 nas kernel: [ 5921.537977] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 23 03:56:06 nas kernel: [ 5921.537984] task:nfsd            state:D stack:    0 pid:11223 ppid:     2 flags:0x00004000
Nov 23 03:56:06 nas kernel: [ 5921.537985] Call Trace:
Nov 23 03:56:06 nas kernel: [ 5921.537986]  <TASK>
Nov 23 03:56:06 nas kernel: [ 5921.537986]  __schedule+0x24e/0x590
Nov 23 03:56:06 nas kernel: [ 5921.537988]  schedule+0x69/0x110
Nov 23 03:56:06 nas kernel: [ 5921.537988]  rwsem_down_write_slowpath+0x23b/0x3f0
Nov 23 03:56:06 nas kernel: [ 5921.537991]  ? exp_get_by_name.part.0+0xa1/0x120 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538001]  down_write+0x47/0x60
Nov 23 03:56:06 nas kernel: [ 5921.538002]  fuse_cache_write_iter+0x96/0x320
Nov 23 03:56:06 nas kernel: [ 5921.538004]  ? find_acceptable_alias+0x2e/0x110
Nov 23 03:56:06 nas kernel: [ 5921.538005]  fuse_file_write_iter+0x61/0x160
Nov 23 03:56:06 nas kernel: [ 5921.538007]  ? aa_file_perm+0x102/0x250
Nov 23 03:56:06 nas kernel: [ 5921.538008]  do_iter_readv_writev+0x14a/0x1b0
Nov 23 03:56:06 nas kernel: [ 5921.538009]  do_iter_write+0x8c/0x160
Nov 23 03:56:06 nas kernel: [ 5921.538010]  vfs_iter_write+0x19/0x30
Nov 23 03:56:06 nas kernel: [ 5921.538011]  nfsd_vfs_write+0x2d5/0x610 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538018]  nfsd4_write+0x130/0x1b0 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538027]  nfsd4_proc_compound+0x43c/0x750 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538034]  ? nfsd_cache_lookup+0x3b7/0x4a0 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538042]  nfsd_dispatch+0x163/0x260 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538049]  svc_process_common+0x3da/0x720 [sunrpc]
Nov 23 03:56:06 nas kernel: [ 5921.538066]  ? nfsd_svc+0x190/0x190 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538072]  svc_process+0xbc/0x100 [sunrpc]
Nov 23 03:56:06 nas kernel: [ 5921.538083]  nfsd+0xed/0x150 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538090]  ? nfsd_shutdown_threads+0x90/0x90 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538096]  kthread+0x12a/0x150
Nov 23 03:56:06 nas kernel: [ 5921.538098]  ? set_kthread_struct+0x50/0x50
Nov 23 03:56:06 nas kernel: [ 5921.538099]  ret_from_fork+0x22/0x30
Nov 23 03:56:06 nas kernel: [ 5921.538101]  </TASK>
Nov 23 03:56:06 nas kernel: [ 5921.538101] INFO: task nfsd:11224 blocked for more than 120 seconds.
Nov 23 03:56:06 nas kernel: [ 5921.538108]       Tainted: P           O      5.15.0-53-generic #59-Ubuntu
Nov 23 03:56:06 nas kernel: [ 5921.538114] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 23 03:56:06 nas kernel: [ 5921.538122] task:nfsd            state:D stack:    0 pid:11224 ppid:     2 flags:0x00004000
Nov 23 03:56:06 nas kernel: [ 5921.538123] Call Trace:
Nov 23 03:56:06 nas kernel: [ 5921.538123]  <TASK>
Nov 23 03:56:06 nas kernel: [ 5921.538124]  __schedule+0x24e/0x590
Nov 23 03:56:06 nas kernel: [ 5921.538125]  schedule+0x69/0x110
Nov 23 03:56:06 nas kernel: [ 5921.538126]  rwsem_down_write_slowpath+0x23b/0x3f0
Nov 23 03:56:06 nas kernel: [ 5921.538127]  ? exp_get_by_name.part.0+0xa1/0x120 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538134]  down_write+0x47/0x60
Nov 23 03:56:06 nas kernel: [ 5921.538135]  fuse_cache_write_iter+0x96/0x320
Nov 23 03:56:06 nas kernel: [ 5921.538136]  ? find_acceptable_alias+0x2e/0x110
Nov 23 03:56:06 nas kernel: [ 5921.538137]  fuse_file_write_iter+0x61/0x160
Nov 23 03:56:06 nas kernel: [ 5921.538138]  ? aa_file_perm+0x102/0x250
Nov 23 03:56:06 nas kernel: [ 5921.538140]  do_iter_readv_writev+0x14a/0x1b0
Nov 23 03:56:06 nas kernel: [ 5921.538140]  do_iter_write+0x8c/0x160
Nov 23 03:56:06 nas kernel: [ 5921.538141]  vfs_iter_write+0x19/0x30
Nov 23 03:56:06 nas kernel: [ 5921.538142]  nfsd_vfs_write+0x2d5/0x610 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538149]  nfsd4_write+0x130/0x1b0 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538156]  nfsd4_proc_compound+0x43c/0x750 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538162]  ? nfsd_cache_lookup+0x3b7/0x4a0 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538169]  nfsd_dispatch+0x163/0x260 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538176]  svc_process_common+0x3da/0x720 [sunrpc]
Nov 23 03:56:06 nas kernel: [ 5921.538187]  ? nfsd_svc+0x190/0x190 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538194]  svc_process+0xbc/0x100 [sunrpc]
Nov 23 03:56:06 nas kernel: [ 5921.538204]  nfsd+0xed/0x150 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538210]  ? nfsd_shutdown_threads+0x90/0x90 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538217]  kthread+0x12a/0x150
Nov 23 03:56:06 nas kernel: [ 5921.538218]  ? set_kthread_struct+0x50/0x50
Nov 23 03:56:06 nas kernel: [ 5921.538219]  ret_from_fork+0x22/0x30
Nov 23 03:56:06 nas kernel: [ 5921.538220]  </TASK>
Nov 23 03:56:06 nas kernel: [ 5921.538221] INFO: task nfsd:11225 blocked for more than 120 seconds.
Nov 23 03:56:06 nas kernel: [ 5921.538227]       Tainted: P           O      5.15.0-53-generic #59-Ubuntu
Nov 23 03:56:06 nas kernel: [ 5921.538234] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 23 03:56:06 nas kernel: [ 5921.538241] task:nfsd            state:D stack:    0 pid:11225 ppid:     2 flags:0x00004000
Nov 23 03:56:06 nas kernel: [ 5921.538242] Call Trace:
Nov 23 03:56:06 nas kernel: [ 5921.538242]  <TASK>
Nov 23 03:56:06 nas kernel: [ 5921.538242]  __schedule+0x24e/0x590
Nov 23 03:56:06 nas kernel: [ 5921.538243]  schedule+0x69/0x110
Nov 23 03:56:06 nas kernel: [ 5921.538244]  rwsem_down_write_slowpath+0x23b/0x3f0
Nov 23 03:56:06 nas kernel: [ 5921.538245]  ? exp_get_by_name.part.0+0xa1/0x120 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538252]  down_write+0x47/0x60
Nov 23 03:56:06 nas kernel: [ 5921.538253]  fuse_cache_write_iter+0x96/0x320
Nov 23 03:56:06 nas kernel: [ 5921.538254]  ? find_acceptable_alias+0x2e/0x110
Nov 23 03:56:06 nas kernel: [ 5921.538255]  fuse_file_write_iter+0x61/0x160
Nov 23 03:56:06 nas kernel: [ 5921.538256]  ? aa_file_perm+0x102/0x250
Nov 23 03:56:06 nas kernel: [ 5921.538258]  do_iter_readv_writev+0x14a/0x1b0
Nov 23 03:56:06 nas kernel: [ 5921.538258]  do_iter_write+0x8c/0x160
Nov 23 03:56:06 nas kernel: [ 5921.538259]  vfs_iter_write+0x19/0x30
Nov 23 03:56:06 nas kernel: [ 5921.538260]  nfsd_vfs_write+0x2d5/0x610 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538267]  nfsd4_write+0x130/0x1b0 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538274]  nfsd4_proc_compound+0x43c/0x750 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538281]  ? nfsd_cache_lookup+0x3b7/0x4a0 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538288]  nfsd_dispatch+0x163/0x260 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538294]  svc_process_common+0x3da/0x720 [sunrpc]
Nov 23 03:56:06 nas kernel: [ 5921.538304]  ? nfsd_svc+0x190/0x190 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538311]  svc_process+0xbc/0x100 [sunrpc]
Nov 23 03:56:06 nas kernel: [ 5921.538321]  nfsd+0xed/0x150 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538327]  ? nfsd_shutdown_threads+0x90/0x90 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538333]  kthread+0x12a/0x150
Nov 23 03:56:06 nas kernel: [ 5921.538334]  ? set_kthread_struct+0x50/0x50
Nov 23 03:56:06 nas kernel: [ 5921.538335]  ret_from_fork+0x22/0x30
Nov 23 03:56:06 nas kernel: [ 5921.538337]  </TASK>
Nov 23 03:56:06 nas kernel: [ 5921.538337] INFO: task nfsd:11226 blocked for more than 120 seconds.
Nov 23 03:56:06 nas kernel: [ 5921.538343]       Tainted: P           O      5.15.0-53-generic #59-Ubuntu
Nov 23 03:56:06 nas kernel: [ 5921.538350] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 23 03:56:06 nas kernel: [ 5921.538357] task:nfsd            state:D stack:    0 pid:11226 ppid:     2 flags:0x00004000
Nov 23 03:56:06 nas kernel: [ 5921.538358] Call Trace:
Nov 23 03:56:06 nas kernel: [ 5921.538358]  <TASK>
Nov 23 03:56:06 nas kernel: [ 5921.538359]  __schedule+0x24e/0x590
Nov 23 03:56:06 nas kernel: [ 5921.538360]  schedule+0x69/0x110
Nov 23 03:56:06 nas kernel: [ 5921.538361]  rwsem_down_write_slowpath+0x23b/0x3f0
Nov 23 03:56:06 nas kernel: [ 5921.538362]  ? exp_get_by_name.part.0+0xa1/0x120 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538369]  down_write+0x47/0x60
Nov 23 03:56:06 nas kernel: [ 5921.538370]  fuse_cache_write_iter+0x96/0x320
Nov 23 03:56:06 nas kernel: [ 5921.538371]  ? find_acceptable_alias+0x2e/0x110
Nov 23 03:56:06 nas kernel: [ 5921.538372]  fuse_file_write_iter+0x61/0x160
Nov 23 03:56:06 nas kernel: [ 5921.538373]  ? aa_file_perm+0x102/0x250
Nov 23 03:56:06 nas kernel: [ 5921.538374]  do_iter_readv_writev+0x14a/0x1b0
Nov 23 03:56:06 nas kernel: [ 5921.538375]  do_iter_write+0x8c/0x160
Nov 23 03:56:06 nas kernel: [ 5921.538376]  vfs_iter_write+0x19/0x30
Nov 23 03:56:06 nas kernel: [ 5921.538376]  nfsd_vfs_write+0x2d5/0x610 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538383]  nfsd4_write+0x130/0x1b0 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538390]  nfsd4_proc_compound+0x43c/0x750 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538396]  ? nfsd_cache_lookup+0x3b7/0x4a0 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538403]  nfsd_dispatch+0x163/0x260 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538409]  svc_process_common+0x3da/0x720 [sunrpc]
Nov 23 03:56:06 nas kernel: [ 5921.538419]  ? nfsd_svc+0x190/0x190 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538425]  svc_process+0xbc/0x100 [sunrpc]
Nov 23 03:56:06 nas kernel: [ 5921.538436]  nfsd+0xed/0x150 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538442]  ? nfsd_shutdown_threads+0x90/0x90 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538449]  kthread+0x12a/0x150
Nov 23 03:56:06 nas kernel: [ 5921.538450]  ? set_kthread_struct+0x50/0x50
Nov 23 03:56:06 nas kernel: [ 5921.538451]  ret_from_fork+0x22/0x30
Nov 23 03:56:06 nas kernel: [ 5921.538452]  </TASK>
Nov 23 03:56:06 nas kernel: [ 5921.538452] INFO: task nfsd:11227 blocked for more than 120 seconds.
Nov 23 03:56:06 nas kernel: [ 5921.538459]       Tainted: P           O      5.15.0-53-generic #59-Ubuntu
Nov 23 03:56:06 nas kernel: [ 5921.538465] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 23 03:56:06 nas kernel: [ 5921.538473] task:nfsd            state:D stack:    0 pid:11227 ppid:     2 flags:0x00004000
Nov 23 03:56:06 nas kernel: [ 5921.538473] Call Trace:
Nov 23 03:56:06 nas kernel: [ 5921.538474]  <TASK>
Nov 23 03:56:06 nas kernel: [ 5921.538474]  __schedule+0x24e/0x590
Nov 23 03:56:06 nas kernel: [ 5921.538475]  schedule+0x69/0x110
Nov 23 03:56:06 nas kernel: [ 5921.538476]  rwsem_down_write_slowpath+0x23b/0x3f0
Nov 23 03:56:06 nas kernel: [ 5921.538477]  ? exp_get_by_name.part.0+0xa1/0x120 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538483]  down_write+0x47/0x60
Nov 23 03:56:06 nas kernel: [ 5921.538484]  fuse_cache_write_iter+0x96/0x320
Nov 23 03:56:06 nas kernel: [ 5921.538486]  ? find_acceptable_alias+0x2e/0x110
Nov 23 03:56:06 nas kernel: [ 5921.538486]  fuse_file_write_iter+0x61/0x160
Nov 23 03:56:06 nas kernel: [ 5921.538488]  ? aa_file_perm+0x102/0x250
Nov 23 03:56:06 nas kernel: [ 5921.538489]  do_iter_readv_writev+0x14a/0x1b0
Nov 23 03:56:06 nas kernel: [ 5921.538489]  do_iter_write+0x8c/0x160
Nov 23 03:56:06 nas kernel: [ 5921.538490]  vfs_iter_write+0x19/0x30
Nov 23 03:56:06 nas kernel: [ 5921.538491]  nfsd_vfs_write+0x2d5/0x610 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538497]  nfsd4_write+0x130/0x1b0 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538504]  nfsd4_proc_compound+0x43c/0x750 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538510]  ? nfsd_cache_lookup+0x3b7/0x4a0 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538517]  nfsd_dispatch+0x163/0x260 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538524]  svc_process_common+0x3da/0x720 [sunrpc]
Nov 23 03:56:06 nas kernel: [ 5921.538535]  ? nfsd_svc+0x190/0x190 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538541]  svc_process+0xbc/0x100 [sunrpc]
Nov 23 03:56:06 nas kernel: [ 5921.538551]  nfsd+0xed/0x150 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538557]  ? nfsd_shutdown_threads+0x90/0x90 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538564]  kthread+0x12a/0x150
Nov 23 03:56:06 nas kernel: [ 5921.538565]  ? set_kthread_struct+0x50/0x50
Nov 23 03:56:06 nas kernel: [ 5921.538566]  ret_from_fork+0x22/0x30
Nov 23 03:56:06 nas kernel: [ 5921.538567]  </TASK>
Nov 23 03:56:06 nas kernel: [ 5921.538568] INFO: task nfsd:11229 blocked for more than 120 seconds.
Nov 23 03:56:06 nas kernel: [ 5921.538574]       Tainted: P           O      5.15.0-53-generic #59-Ubuntu
Nov 23 03:56:06 nas kernel: [ 5921.538581] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 23 03:56:06 nas kernel: [ 5921.538589] task:nfsd            state:D stack:    0 pid:11229 ppid:     2 flags:0x00004000
Nov 23 03:56:06 nas kernel: [ 5921.538589] Call Trace:
Nov 23 03:56:06 nas kernel: [ 5921.538590]  <TASK>
Nov 23 03:56:06 nas kernel: [ 5921.538590]  __schedule+0x24e/0x590
Nov 23 03:56:06 nas kernel: [ 5921.538591]  schedule+0x69/0x110
Nov 23 03:56:06 nas kernel: [ 5921.538592]  rwsem_down_write_slowpath+0x23b/0x3f0
Nov 23 03:56:06 nas kernel: [ 5921.538593]  ? exp_get_by_name.part.0+0xa1/0x120 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538600]  down_write+0x47/0x60
Nov 23 03:56:06 nas kernel: [ 5921.538601]  fuse_cache_write_iter+0x96/0x320
Nov 23 03:56:06 nas kernel: [ 5921.538602]  ? find_acceptable_alias+0x2e/0x110
Nov 23 03:56:06 nas kernel: [ 5921.538603]  fuse_file_write_iter+0x61/0x160
Nov 23 03:56:06 nas kernel: [ 5921.538604]  ? aa_file_perm+0x102/0x250
Nov 23 03:56:06 nas kernel: [ 5921.538605]  do_iter_readv_writev+0x14a/0x1b0
Nov 23 03:56:06 nas kernel: [ 5921.538606]  do_iter_write+0x8c/0x160
Nov 23 03:56:06 nas kernel: [ 5921.538607]  vfs_iter_write+0x19/0x30
Nov 23 03:56:06 nas kernel: [ 5921.538607]  nfsd_vfs_write+0x2d5/0x610 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538614]  nfsd4_write+0x130/0x1b0 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538621]  nfsd4_proc_compound+0x43c/0x750 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538627]  ? nfsd_cache_lookup+0x3b7/0x4a0 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538633]  nfsd_dispatch+0x163/0x260 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538640]  svc_process_common+0x3da/0x720 [sunrpc]
Nov 23 03:56:06 nas kernel: [ 5921.538650]  ? nfsd_svc+0x190/0x190 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538656]  svc_process+0xbc/0x100 [sunrpc]
Nov 23 03:56:06 nas kernel: [ 5921.538665]  nfsd+0xed/0x150 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538671]  ? nfsd_shutdown_threads+0x90/0x90 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538677]  kthread+0x12a/0x150
Nov 23 03:56:06 nas kernel: [ 5921.538678]  ? set_kthread_struct+0x50/0x50
Nov 23 03:56:06 nas kernel: [ 5921.538680]  ret_from_fork+0x22/0x30
Nov 23 03:56:06 nas kernel: [ 5921.538681]  </TASK>
Nov 23 03:56:06 nas kernel: [ 5921.538681] INFO: task nfsd:11230 blocked for more than 120 seconds.
Nov 23 03:56:06 nas kernel: [ 5921.538688]       Tainted: P           O      5.15.0-53-generic #59-Ubuntu
Nov 23 03:56:06 nas kernel: [ 5921.538694] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 23 03:56:06 nas kernel: [ 5921.538701] task:nfsd            state:D stack:    0 pid:11230 ppid:     2 flags:0x00004000
Nov 23 03:56:06 nas kernel: [ 5921.538702] Call Trace:
Nov 23 03:56:06 nas kernel: [ 5921.538702]  <TASK>
Nov 23 03:56:06 nas kernel: [ 5921.538703]  __schedule+0x24e/0x590
Nov 23 03:56:06 nas kernel: [ 5921.538704]  schedule+0x69/0x110
Nov 23 03:56:06 nas kernel: [ 5921.538704]  rwsem_down_write_slowpath+0x23b/0x3f0
Nov 23 03:56:06 nas kernel: [ 5921.538706]  ? exp_get_by_name.part.0+0xa1/0x120 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538712]  down_write+0x47/0x60
Nov 23 03:56:06 nas kernel: [ 5921.538713]  fuse_cache_write_iter+0x96/0x320
Nov 23 03:56:06 nas kernel: [ 5921.538714]  ? find_acceptable_alias+0x2e/0x110
Nov 23 03:56:06 nas kernel: [ 5921.538715]  fuse_file_write_iter+0x61/0x160
Nov 23 03:56:06 nas kernel: [ 5921.538716]  ? aa_file_perm+0x102/0x250
Nov 23 03:56:06 nas kernel: [ 5921.538718]  do_iter_readv_writev+0x14a/0x1b0
Nov 23 03:56:06 nas kernel: [ 5921.538718]  do_iter_write+0x8c/0x160
Nov 23 03:56:06 nas kernel: [ 5921.538719]  vfs_iter_write+0x19/0x30
Nov 23 03:56:06 nas kernel: [ 5921.538720]  nfsd_vfs_write+0x2d5/0x610 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538726]  nfsd4_write+0x130/0x1b0 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538733]  nfsd4_proc_compound+0x43c/0x750 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538739]  ? nfsd_cache_lookup+0x3b7/0x4a0 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538746]  nfsd_dispatch+0x163/0x260 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538752]  svc_process_common+0x3da/0x720 [sunrpc]
Nov 23 03:56:06 nas kernel: [ 5921.538762]  ? nfsd_svc+0x190/0x190 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538768]  svc_process+0xbc/0x100 [sunrpc]
Nov 23 03:56:06 nas kernel: [ 5921.538777]  nfsd+0xed/0x150 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538783]  ? nfsd_shutdown_threads+0x90/0x90 [nfsd]
Nov 23 03:56:06 nas kernel: [ 5921.538790]  kthread+0x12a/0x150
Nov 23 03:56:06 nas kernel: [ 5921.538791]  ? set_kthread_struct+0x50/0x50
Nov 23 03:56:06 nas kernel: [ 5921.538792]  ret_from_fork+0x22/0x30
Nov 23 03:56:06 nas kernel: [ 5921.538793]  </TASK>
Nov 23 03:57:19 nas systemd[1]: Started Session 24 of User gfm.
Nov 23 03:57:19 nas systemd[1]: Started Session 25 of User gfm.
Nov 23 03:57:49 nas rpc.mountd[11180]: v4.2 client attached: 0x31ae39c6637dcb10 from "192.168.1.254:682"

2 replies

TheLinuxGuy Nov 23, 2022
Author

I keep seeing "btrfs" in these stack traces. Since I have already swapped out ZFS with XFS for NVME cache and these crashes continue I think I am going to try with XFS on my "slow disks" branch. Then restart my tests again.

trapexit Nov 23, 2022
Maintainer

The best thing to do is to minimize all things down as much as possible and build up till it breaks.

Stress test just the filesystem alone. Then with nfs. then with mergerfs. or whatever.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mergerfs pool mount fails. Best way to monitor and self-recovery via script? #1098

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 14 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

mergerfs pool mount fails. Best way to monitor and self-recovery via script? #1098

TheLinuxGuy Nov 20, 2022

Replies: 7 comments · 14 replies

trapexit Nov 20, 2022 Maintainer

TheLinuxGuy Nov 20, 2022 Author

trapexit Nov 20, 2022 Maintainer

trapexit Nov 20, 2022 Maintainer

TheLinuxGuy Nov 20, 2022 Author

TheLinuxGuy Nov 21, 2022 Author

TheLinuxGuy Nov 21, 2022 Author

trapexit Nov 23, 2022 Maintainer

TheLinuxGuy Nov 23, 2022 Author

TheLinuxGuy Nov 22, 2022 Author

trapexit Nov 23, 2022 Maintainer

TheLinuxGuy Nov 26, 2022 Author

trapexit Nov 26, 2022 Maintainer

trapexit Nov 26, 2022 Maintainer

TheLinuxGuy Nov 26, 2022 Author

trapexit Nov 26, 2022 Maintainer

TheLinuxGuy Nov 23, 2022 Author

TheLinuxGuy Nov 23, 2022 Author

trapexit Nov 23, 2022 Maintainer

TheLinuxGuy
Nov 20, 2022

Replies: 7 comments 14 replies

trapexit
Nov 20, 2022
Maintainer

TheLinuxGuy Nov 20, 2022
Author

trapexit Nov 20, 2022
Maintainer

trapexit
Nov 20, 2022
Maintainer

TheLinuxGuy Nov 20, 2022
Author

TheLinuxGuy
Nov 21, 2022
Author

TheLinuxGuy
Nov 21, 2022
Author

trapexit Nov 23, 2022
Maintainer

TheLinuxGuy Nov 23, 2022
Author

TheLinuxGuy
Nov 22, 2022
Author

trapexit
Nov 23, 2022
Maintainer

TheLinuxGuy Nov 26, 2022
Author

trapexit Nov 26, 2022
Maintainer

trapexit Nov 26, 2022
Maintainer

TheLinuxGuy Nov 26, 2022
Author

trapexit Nov 26, 2022
Maintainer

TheLinuxGuy
Nov 23, 2022
Author

TheLinuxGuy Nov 23, 2022
Author

trapexit Nov 23, 2022
Maintainer