Need a First Aid utility #90

probonopd · 2021-01-06T17:57:04Z

From one day to the next, the bootloader greeted me with

ZFS i/o error - all block copies unavailable
LUA ERROR: memory allocation error: block too big.

can't load 'kernel'

This teaches me a couple of things:

To increase robustness of the system, we need some real r/o base system
We need a Backup utility even though we have snapshots and BEs and whatnot. Stuff can still fail if the entire pool or the drive fails
We need a Disk First Aid utility (and it needs to run from the Live system). If helloSystem is going to be used by "mere mortals", then this will be needed no matter how easy the Backup utility is...

The text was updated successfully, but these errors were encountered:

probonopd · 2021-01-06T17:59:43Z

Maybe make an assistant that guides the user through the process, and along the way eplains in plain English what is being run and why, and that it means. Ask for confirmation before doing anything innvasive. Maybe have a details section that shows and explains the commands being run, linking/citing their man page. This way the user learns about zfs tools while the job is being performed (unless on the Mac, where commands are run but not really explained.)

The low-hanging fruit (much better than having nothing!) would be an assistant that just tells the user the steps, but has them type in the commands for themselves. Once this has been tried and tested, one could then offer to execute the commands for them in a later version.

Is the following workflow sane? Are there additional/better steps?

Phase 1: Recovery (non-invasive)

Boot into the Live system
Import the zfs pool but do not allow it to auto mount any file systems: zpool import -o altroot=/tmp/altroot -N -a. This fails on an installed system because there is already a pool with the same name Hence use sudo zpool import -N -F -n which should show the numeric identifier for the pool:

% sudo zpool import -N -F -n     
   pool: zroot
     id: 7436974874527219340
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

        zroot             ONLINE
          gpt/nbsdrootfs  ONLINE

Try to import the pool read-only but do not mount: sudo zpool import -o altroot=/mnt -o readonly=on -N -F 7436974874527219340 temporary -f
Try to import the pool read-only and mount to /mnt: sudo zpool import -o altroot=/mnt -o readonly=on 7436974874527219340 temporary -f

% sudo zpool import -o altroot=/mnt -o readonly=on 7436974874527219340 temporary 
cannot mount 'temporary/var/mail': Unknown error: 122

Check the pool for errors:

% sudo zpool status temporary              
  pool: temporary
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: none requested
config:

        NAME              STATE     READ WRITE CKSUM
        temporary         ONLINE       0     0     4
          gpt/nbsdrootfs  ONLINE       0     0    16

errors: 115 data errors, use '-v' for a list

115 errors! This does not sound good!
Display the affected files: sudo zpool status -v temporary

At this point we may be able to copy data from e.g., /mnt/usr/home.

Using tar

tar is that is always available, while rsync may not.

sudo su
cd /mnt/usr/home/user/
tar -C /mnt/usr/home/user/ --one-file-system  -cf - * | tar -C /home/user/RECOVERED -xvf -

Getting

x Downloads/FuryBSD-12.1-XFCE-2020042001.isotar: (null)
: Truncated tar archive
tar: Error exit delayed from previous errors.

What can we do so that the operation does not stop if some files cannot be read? We want to copy off as much data as possible.

Using rsync

rsync could do differential syncs, but we do not need this here in most cases.

sudo pkg install rsync (it is currently not included in helloSystem by default)

sudo rsync -axAHX /mnt/usr/home/user/ /home/user/RECOVERED.

The option -H takes care for the hard links.
The option -x instructs not to cross file systems. So there is no need to exclude mountpoints.

This nicely prints out errors but continues.

rsync: [sender] read errors mapping "/mnt/usr/home/user/.cache/chromium/Default/Code Cache/js/91380c1625ac9c69_0": Input/output error (5)
rsync: [sender] read errors mapping "/mnt/usr/home/user/.cache/chromium/Default/Code Cache/js/91c6a9ede3ddfff5_0": Input/output error (5)
(...)
ERROR: .cache/chromium/Default/Code Cache/js/ce55c1609f739f75_0 failed verification -- update discarded.
ERROR: .cache/chromium/Default/Code Cache/js/cf25fc65ff26f560_0 failed verification -- update discarded.

Since rsync verifies what it copies it should probably be preferred over tar/cp?

Since rsync discards defective files, it may be advisable to consider adding some additional process that would try to at least partically recover files that could not be copied with rsync. How?

Phase 2: Repair (Invasive)

Do not proceed further before the user has confirmed that essential data has been backed up or copied in the step before.
Export the pool: sudo zpool export temporary, this may fail if the pool was not imported before
Repeat the import with -F. CAUTION:

Recovery mode for a non-importable pool. Attempt to return
the pool to an importable state by discarding the last few
transactions. Not all damaged pools can be recovered by using
this option. If successful, the data from the discarded
transactions is irretrievably lost. This option is ignored if
the pool is importable or already imported.

sudo zpool import -F -o altroot=/mnt -o readonly=on 7436974874527219340 temporary
cannot mount 'temporary/var/mail': Unknown error: 122

Despite this error, everything but var/mail was mounted:

mount | grep ^temporary                                                          
temporary/usr/home on /mnt/usr/home (zfs, local, noatime, read-only, nfsv4acls)
temporary/var/log on /mnt/var/log (zfs, local, noatime, noexec, nosuid, read-only, nfsv4acls)
temporary/usr/src on /mnt/usr/src (zfs, local, noatime, noexec, nosuid, read-only, nfsv4acls)
temporary/Applications on /mnt/Applications (zfs, local, noatime, read-only, nfsv4acls)
temporary/var/audit on /mnt/var/audit (zfs, local, noatime, noexec, nosuid, read-only, nfsv4acls)
temporary/usr/obj on /mnt/usr/obj (zfs, local, noatime, read-only, nfsv4acls)
temporary/var/crash on /mnt/var/crash (zfs, local, noatime, noexec, nosuid, read-only, nfsv4acls)
temporary/usr/ports on /mnt/usr/ports (zfs, local, noatime, nosuid, read-only, nfsv4acls)
temporary/var/tmp on /mnt/var/tmp (zfs, local, noatime, nosuid, read-only, nfsv4acls)
temporary/usr/ports/packages on /mnt/usr/ports/packages (zfs, local, noatime, noexec, nosuid, read-only, nfsv4acls)
temporary/usr/ports/distfiles on /mnt/usr/ports/distfiles (zfs, local, noatime, noexec, nosuid, read-only, nfsv4acls)

Scrub using sudo zpool scrub temporary. Note that this command returns within a few seconds, but the actual scrub is running in the background. During this time, the system performance may be degraded.

 zpool scrub [-s | -p] pool	...
Begins a scrub or resumes a paused scrub. The scrub examines all
data in the specified pools to verify that it checksums correctly.
For replicated (mirror or raidz) devices, ZFS automatically repairs
any damage discovered during the scrub. The zpool status command re-
ports the progress of the scrub and summarizes the results of the
scrub upon completion.

During the scrub we can run sudo zpool status temporary 5 to get status information every 5 seconds. It prints out a progress percentage.

When it is done it says:

  pool: temporary
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 428K in 0 days 00:25:44 with 1012 errors on Wed Jan  6 05:52:19 2021
config:

        NAME              STATE     READ WRITE CKSUM
        temporary         ONLINE       0     0 1,03K
          gpt/nbsdrootfs  ONLINE       0     0 2,86K

At this point we still have data errors:

zpool status temporary                                       
  pool: temporary
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 428K in 0 days 00:25:44 with 1012 errors on Wed Jan  6 06:52:19 2021
config:

        NAME              STATE     READ WRITE CKSUM
        temporary         ONLINE       0     0 1,03K
          gpt/nbsdrootfs  ONLINE       0     0 2,86K

errors: 1121 data errors, use '-v' for a list

Note At this point I gave up since I could not find a way to fix the errors.

Phase 3: Diagnose

Run short or extended SMART self test on the device using smartctl

grahamperrin · 2021-01-06T20:57:28Z

Generally, worth noting: https://www.freebsd.org/cgi/man.cgi?query=zpool-import(8) option -X

Keyword: extreme

https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A/ "… the only choice to repair the data is to restore the pool from backup …" is not always true. Sometimes all that's required is a scrub.

@probonopd I should recommend separate diagnosis of what happened in your case. Happy to discuss in Matrix.

grahamperrin · 2021-01-06T21:00:04Z

openzfs/zfs#7912

probonopd · 2021-01-09T16:08:42Z

Generally, worth noting: https://www.freebsd.org/cgi/man.cgi?query=zpool-import(8) option -X

No change in my case:

% sudo zpool import -F -X -o altroot=/mnt -o readonly=on 7436974874527219340 temporary
cannot mount 'temporary/var/mail': Unknown error: 122

Is there any way to at least understand what is wong with that var/mail dataset?

I only see boot-related issues in zrepl snapshots, so I still wonder why the system became unbootable.

% sudo zpool status -v temporary | grep boot
        temporary/ROOT/First@zrepl_20201229_093233_000:/boot/loader.conf
        temporary/ROOT/First@zrepl_20201227_202021_000:/usr/local/furybsd/cdroot/boot/logo-orbbw.4th
        temporary/ROOT/First@zrepl_20201227_202021_000:/usr/local/furybsd/cdroot/boot/modules/amdgpu_fiji_smc_bin.ko
        temporary/ROOT/First@zrepl_20201227_202021_000:/usr/local/furybsd/cdroot/boot/modules/amdgpu_vegam_uvd_bin.ko
        temporary/ROOT/First@zrepl_20201227_202021_000:/usr/local/furybsd/cdroot/boot/modules/amdgpu_fiji_mec_bin.ko
        temporary/ROOT/First@zrepl_20201227_202021_000:/usr/local/furybsd/cdroot/boot/modules/i915_bxt_guc_ver8_7_bin.ko

grahamperrin · 2021-01-10T11:40:37Z

https://matrix.to/#/!EKNFbsWSwXpDOGLRex:matrix.org/$TnriyWcxIJPVlQKL2g2aso83jOPkcuzwYd_Nw9CQj1M?via=matrix.org&via=t2bot.io:

ssd, sata

From https://matrix.to/#/!EKNFbsWSwXpDOGLRex:matrix.org/$uREj8RzAu8mi_FnnY42AOZvrqOGIIu-IgXPSxz-47PA?via=matrix.org&via=t2bot.io:

Extended SMART self-test completed w/o errors. SSD lifetime hours: ~2.000

maybe we should put that extended test into the First Aid utility (using smartctl)

grahamperrin · 2021-01-10T13:18:45Z

#90 (comment)

… We want to copy off as much data as possible. …

There's that wish, but there's a parallel need to know how much of what's to be copied was subject to a prior error. For this, I should treat output from zpool status -v as definitive …

… subsequent use of rsync(1) might copy (from good media) data that is corrupt.

… I could not find a way to fix the errors.

Given this, after a scrub:

   action: Restore the file in question if possible.  Otherwise restore the
           entire pool from backup.

– those are the most appropriate actions.

(For a ZFS pool to self-heal, to 'fix' itself, typically requires more then one device in the pool.)

It'll be good for someone with ZFS expertise to have a glance at this case.

probonopd · 2021-01-10T13:43:34Z

My line of thinking is:

Get as much data off the drive as possible before doing any kind of repair. Even half-damaged files are better than no files
Then try to repair

grahamperrin · 2021-01-10T16:37:39Z

ZFS

I hesitate before referring to Wikipedia but these words from https://en.wikipedia.org/wiki/ZFS#Data_recovery are relevant:

… If the pool was compromised because of poor hardware, inadequate design or redundancy, or unfortunate mishap, to the point that ZFS was unable to mount the pool, …

Essentially: there's inadequate redundancy.

Re: #90 (comment) if the S.M.A.R.T. status is to be trusted then I wonder whether there was a problem with your SATA connection at the time(s) of problem(s) occurring.

#90 (comment)

… still wonder why the system became unbootable. …

zdb(8) is your friend however it is:

… neither a fsck(8) nor an fsdb(8) utility. …

Cases can be very diverse. Generalised repair or recovery of data from a compromised ZFS pool is (I think) out of scope for a helloSystem First Aid utility. If you attempt this without openzfs/zfs#7912 I foresee frustration and/or disappointment – if not for you, then (eventually) for some other user of the OS.

Other file systems

A front end to fsck(8) is a good idea but this, I think, falls more under the umbrella of #61

Data recovery

Get as much data off the drive as possible before doing any kind of repair. Even half-damaged files are better than no files

Be aware of things such as this:

DDRescue-GUI in Launchpad | https://www.hamishmb.com/blog/tag/ddrescue-gui/

– however if you envision any such thing within helloSystem, then you should be prepared for end users to expect or demand support from the helloSystem community, when it's more appropriate to seek support elsewhere. A world of pain.

probonopd · 2021-01-10T16:45:44Z

Not sure whether ddrescue and fsck can help in the case of zfs... but yes, i think those tools should be in the First Aid utility for other filesystems.

grahamperrin · 2021-01-10T16:46:47Z

So, Storage/Disk Utility should not have check or repair capabilities?

probonopd · 2021-01-10T17:20:54Z

I was thinking of separate utilities because implementing First Aid seems much easier to me than Disk Utility, but once we have Disk Utility we could put the functionality in there for sure.

grahamperrin · 2021-01-10T17:47:25Z

#90 (comment) we have the result of the self-test but I forgot to ask the obvious – thanks to a hint from idwer in #openzfs:

smartctl -H device

probonopd · 2021-01-10T19:53:12Z

% sudo smartctl -H /dev/da0
smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.1-RELEASE amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

grahamperrin · 2021-01-26T19:16:43Z

Via openzfs/zfs#7912 (comment):

https://github.com/t-oster/zfs-repair-dataset

Via https://lists.freebsd.org/pipermail/freebsd-current/2021-January/078574.html:

https://www.freebsd.org/cgi/man.cgi?query=recoverdisk(1)

probonopd · 2021-01-27T07:57:36Z

I think recoverdisk is not usable for SSDs, but we should offer that route for optical media and spinning mechanical drives in the First Aid utility

grahamperrin · 2021-02-02T21:11:45Z

For reference, via https://lists.freebsd.org/pipermail/freebsd-questions/2021-February/292837.html:

fsdb(8)

grahamperrin · 2021-02-16T09:24:33Z

Re: helloSystem/Utilities#33 (comment) and maybe overlapping with #61

Consider enhancing helloSystem's custom installer for FreeBSD to:

include creation of an additional partition or boot environment
create and populate a reasonably simple, 'vanilla' boot environment that allows root (without requiring a password for many things) – analogous to Apple's Recovery OS (familiarly recoveryOS)
create, populate and activate the boot environment that will be intended for everyday use of helloSystem …

For (1) and (2) the essence should be for the (multi-purpose) recovery system to be never spoilt by an end user. So a suitably sized partition may be preferable to a recovery boot environment within the same pool as helloSystem boot environments.

recoveryOS and diagnostics environments on Mac computers - Apple Support

grahamperrin · 2021-03-05T07:21:23Z

APFS

Parallel to (not necessarily requiring) helloSystem/ISO#170

Apple File System Reference

https://developer.apple.com/support/downloads/Apple-File-System-Reference.pdf

More

In no particular order …

Decoding the APFS file system

Kurt H.Hansen and FergusToolan, Norwegian Police University College, 2017

https://doi.org/10.1016/j.diin.2017.07.003 includes:

A lower resolution version of the former:

Miscellaneous

https://forums.freebsd.org/threads/mounting-apfs-partition.69094/ (title unavailable at the time of writing; forum upgrade in progress)

A ZFS developer’s analysis of the good and bad in Apple’s new APFS file system | Ars Technica (2016-06-26)

via https://old.reddit.com/r/programming/duplicates/4pziho/-/

Big Sur finally uses APFS snapshots for Time Machine backups! | MacRumors Forums (2020-06-27)

probonopd · 2021-03-05T16:51:03Z

Thanks for your thorough research @grahamperrin but I think our focus should be on recorvery and repair for native filesystems before we even think about "alien" ones. Probably repairing defective APFS filesystems is best left to macOS.

grahamperrin · 2021-03-05T17:15:24Z

Thanks, I added APFS because it's the primary file/storage system for Apple users, moreso because helloSystem is designed to appeal to users of Apple hardware.

grahamperrin mentioned this issue Feb 16, 2021

System Update utility helloSystem/Utilities#33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need a First Aid utility #90

Need a First Aid utility #90

probonopd commented Jan 6, 2021 •

edited

Loading

probonopd commented Jan 6, 2021 •

edited

Loading

grahamperrin commented Jan 6, 2021

grahamperrin commented Jan 6, 2021

probonopd commented Jan 9, 2021 •

edited

Loading

grahamperrin commented Jan 10, 2021 •

edited

Loading

grahamperrin commented Jan 10, 2021

probonopd commented Jan 10, 2021

grahamperrin commented Jan 10, 2021

probonopd commented Jan 10, 2021

grahamperrin commented Jan 10, 2021

probonopd commented Jan 10, 2021

grahamperrin commented Jan 10, 2021

probonopd commented Jan 10, 2021

grahamperrin commented Jan 26, 2021

probonopd commented Jan 27, 2021

grahamperrin commented Feb 2, 2021

grahamperrin commented Feb 16, 2021 •

edited

Loading

grahamperrin commented Mar 5, 2021

probonopd commented Mar 5, 2021 •

edited

Loading

grahamperrin commented Mar 5, 2021

Need a First Aid utility #90

Need a First Aid utility #90

Comments

probonopd commented Jan 6, 2021 • edited Loading

probonopd commented Jan 6, 2021 • edited Loading

Phase 1: Recovery (non-invasive)

Using tar

Using rsync

Phase 2: Repair (Invasive)

Phase 3: Diagnose

grahamperrin commented Jan 6, 2021

grahamperrin commented Jan 6, 2021

probonopd commented Jan 9, 2021 • edited Loading

grahamperrin commented Jan 10, 2021 • edited Loading

grahamperrin commented Jan 10, 2021

probonopd commented Jan 10, 2021

grahamperrin commented Jan 10, 2021

ZFS

Other file systems

Data recovery

probonopd commented Jan 10, 2021

grahamperrin commented Jan 10, 2021

probonopd commented Jan 10, 2021

grahamperrin commented Jan 10, 2021

probonopd commented Jan 10, 2021

grahamperrin commented Jan 26, 2021

probonopd commented Jan 27, 2021

grahamperrin commented Feb 2, 2021

grahamperrin commented Feb 16, 2021 • edited Loading

grahamperrin commented Mar 5, 2021

APFS

Apple File System Reference

More

Decoding the APFS file system

afro (APFS file recovery)

linux-apfs/apfsprogs

Drat (formerly "APFS Tools"/apfs-tools)

Hetman Software blog

Miscellaneous

probonopd commented Mar 5, 2021 • edited Loading

grahamperrin commented Mar 5, 2021

probonopd commented Jan 6, 2021 •

edited

Loading

probonopd commented Jan 6, 2021 •

edited

Loading

probonopd commented Jan 9, 2021 •

edited

Loading

grahamperrin commented Jan 10, 2021 •

edited

Loading

grahamperrin commented Feb 16, 2021 •

edited

Loading

probonopd commented Mar 5, 2021 •

edited

Loading