Skip to content

Physical rewrite #17565

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Physical rewrite #17565

wants to merge 2 commits into from

Conversation

amotin
Copy link
Member

@amotin amotin commented Jul 24, 2025

Motivation and Context

The earlier implemented zfs rewrite functionality for simplicity updated logical birth times of all rewritten blocks. It makes them look modified from perspective of replication, snapshot diffs, etc, even though the actual user data remain the same. While some people found it useful to recover corrupted remote backups, for majority replication of large extra amounts of logically unchanged blocks can be a huge waste of time and resources.

Description

This PR implements a new variation of rewrite, called "physical rewrite", controlled by the new -P argument to the zfs rewrite subcommand. When possible, it tries to keep logical birth times unchanged. It allows to distinguish blocks that were just relocated within a pool from blocks that were actually modified by users. While the first may occupy additional disk space due to snapshots, block cloning, etc, that should be accounted as such, they should be ignored by replication, etc.

Previously we've had block pointers with physical birth times bigger than logical birth times only as result of device removal remap process. But in that case space usage accounting was still based on block's logical birth times. Since physical rewrites require space reallocation accounted based on the physical birth times, to differentiate those two cases this PR introduces new "R"/"rewrite" flag in the block pointer structure. When set, it means the block's space accounting should use physical birth time instead of traditional logical birth time. Since read-only pool imports do not really care about space accounting, the new per-dataset pool feature "physical_rewrite" gating this is declared as read-compatible. The feature will be activated on first use and deactivated when last of affected datasets is deleted.

There are two exceptions when logical birth time might still be modified around physical rewrite:

  • In case of dedup hit, producing a different block pointer due to change of checksum algorithm or number of copies. Since the physical birth time must come from the DDT record, we can not put the current TXG there for space accounting, and have to update logical birth time instead, as done for logical rewrite. Aside of this quite rare case attempts to do physical rewrite on dedup'ed blocks should be NOP, but it can be used to enable/disable dedup.
  • In case of device is removal after physical rewrite. Since pointer remapping after device removal must set physical birth times to the removal time, it has to remove the rewrite flag and copy the physical birth times of the blocks into logical birth times to maintain correct space accounting.

Now that we have different birth times in block pointers, traversal code got new TRAVERSE_LOGICAL flag, allowing to choose between traversing only logical changes (replication, diff, etc), or physical changes (scrub/resilver, dataset destroy, etc).

How Has This Been Tested?

Several successful CI runs. Manual testing with zfs rewrite and zfs rewrite -P vs zfs send -i.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Jul 24, 2025
@amotin amotin mentioned this pull request Jul 24, 2025
13 tasks
@amotin amotin force-pushed the physical_rewrite branch from 4093181 to 6f42692 Compare July 25, 2025 01:22
@gamanakis
Copy link
Contributor

@amotin, thank you for this! on a first pass it looks good to me.

During regular block writes ZFS sets both logical and physical
birth times equal to the current TXG.  During dedup and block
cloning logical birth time is still set to the current TXG, but
physical may be copied from the original block that was used.
This represents the fact that logically user data has changed,
but the physically it is the same old block.

But block rewrite introduces a new situation, when block is not
changed logically, but stored in a different place of the pool.
From ARC, scrub and some other perspectives this is a new block,
but for example for user applications or incremental replication
it is not.  Somewhat similar thing happen during remap phase of
device removal, but in that case space blocks are still acounted
as allocated at their logical birth times.

This patch introduces a new "rewrite" flag in the block pointer
structure, allowing to differentiate physical rewrite (when the
block is actually reallocated at the physical birth time) from
the device reval case (when the logical birth time is used).

The new functionality is not used at this point, and the only
expected change is that error log is now kept in terms of physical
physical birth times, rather than logical, since if a block with
logged error was somehow rewritten, then the previous error does
not matter any more.

This change also introduces a new TRAVERSE_LOGICAL flag to the
traverse code, allowing zfs send, redact and diff to work in
context of logical birth times, ignoring physical-only rewrites.
It also changes nothing at this point due to lack of those writes,
but they will come in a following patch.

Signed-off-by:	Alexander Motin <[email protected]>
@amotin amotin force-pushed the physical_rewrite branch from 6f42692 to 3b724aa Compare July 30, 2025 16:58
Based on previous commit this implements `zfs rewrite -P` flag,
making ZFS to keep blocks logical birth times while rewriting
files.  It should exclude the rewritten blocks from incremental
sends, snapshot diffs, etc.  Snapshots space usage same time will
reflect the additional space usage from newly allocated blocks.

Since this begins to use new "rewrite" flag in the block pointers,
this commit introduces a new read-compatible per-dataset feature
physical_rewrite.  It must be enabled for the command to not fail,
it is activated on first use and deactivated on deletion of the
last affected dataset.

Signed-off-by:  Alexander Motin <[email protected]>
@amotin amotin force-pushed the physical_rewrite branch from 3b724aa to c9382ac Compare July 30, 2025 17:22
@amotin
Copy link
Member Author

amotin commented Jul 30, 2025

Just a rebase and conflict resolution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants