Rework, simplify some tiering logic for mutable options #13256

pdillinger · 2024-12-27T22:37:33Z

Summary: The primary goal of this change was to support full dynamic mutability of options preclude_last_level_data_seconds and preserve_internal_time_seconds, which was challenging because of subtle design holes referenced from #13124.

The fix is, in a sense, "doubling down" on the idea of write-time-based tiering, by simplifying the output level decision with a single sequence number threshold. This approach has some advantages:

Allows option mutability in presence of long snapshots (or UDT)
Simpler to believe correct because there's no special treatment for range tombstones, and output level assignment does not affect sequence number assignment to the entries (which takes some care to avoid circular dependency; see CompactionIterator stuff below).
Avoids extra key comparisons, in WithinPenultimateLevelOutputRange(), in relevant compactions (more CPU efficient, though untested).

There are two big pieces/changes to enable this simplification to a single penultimate_after_seqno_ threshold:

Allow range tombstones to be sent to either output level, based on sequence number.
Use sequence numbers instead of range checks to avoid data in the last level from moving to penultimate level outside of the permissable range on that level (due to compaction selecting wider range in the later input level, which is the normal output level). With this change, data can only move "back up the LSM" when entire sorted runs are selected for comapction.

Possible disadvantages:

Extra CPU to iterate over range tombstones in relevant compactions twice instead of once. However, work loads with lots of range tombstones relative to other entries should be rare.
Data might not migrate back up the LSM tree on option changes as aggressively or consistently. This should a a rare concern, however, especially for universal compaction where selecting full sorted runs is normal compaction.
This approach is arguably "further away from" a design that allows for other kinds of output level placement decisions, such as range-based input data hotness. However, properly handling range tombstones with such policies will likely require flexible placement into outputs, as this change introduces.

Additional details:

For good code abstraction, separate CompactionIterator from the concern of where to place compaction outputs. CompactionIterator is supposed to provide a stream of entries, including the "best" sequence number we can assign to those entries. If it's safe and proper to zero out a sequence number, the placement of entries to outputs should deal with that safely rather than having complex inter-dependency between sequence number assignment and placement. To achieve this, we migrate all the compaction output placement logic that was in CompactionIterator to CompactionJob and similar. This unfortunately renders some unit tests (PerKeyPlacementCompIteratorTest) depending on the bad abstraction as obsolete, but tiered_compaction_test has pretty good coverage overall, catching many issues during this development.

Intended follow-up:

See FIXME items in tiered_compaction_test
More testing / validation / support for tiering + UDT
Consider generalizing this work to split results at other levels as appropriate based on stats (auto-tuning essentially). Allowing only the last level to be cold is limiting.

Test Plan: tests were added in previous changes (#13244 #13124), and updated here to reflect correct operation (with some known problems for leveled compaction)

Summary: The primary goal of this change was to support full dynamic mutability of options `preclude_last_level_data_seconds` and `preserve_internal_time_seconds`, which was challenging because of subtle design holes referenced from facebook#13124. The fix is, in a sense, "doubling down" on the idea of write-time-based tiering, by simplifying the output level decision with a single sequence number threshold. This approach has some advantages: * Allows option mutability in presence of long snapshots (or UDT) * Simpler to believe correct because there's no special treatment for range tombstones, and output level assignment does not affect sequence number assignment to the entries (which takes some care to avoid circular dependency; see CompactionIterator stuff below). * Avoids extra key comparisons, in `WithinPenultimateLevelOutputRange()`, in relevant compactions (more CPU efficient, though untested). There are two big pieces/changes to enable this simplification to a single `penultimate_after_seqno_` threshold: * Allow range tombstones to be sent to either output level, based on sequence number. * Use sequence numbers instead of range checks to avoid data in the last level from moving to penultimate level outside of the permissable range on that level (due to compaction selecting wider range in the later input level, which is the normal output level). With this change, data can only move "back up the LSM" when entire sorted runs are selected for comapction. Possible disadvantages: * Extra CPU to iterate over range tombstones in relevant compactions *twice* instead of once. However, work loads with lots of range tombstones relative to other entries should be rare. * Data might not migrate back up the LSM tree on option changes as aggressively or consistently. This should a a rare concern, however, especially for universal compaction where selecting full sorted runs is normal compaction. * This approach is arguably "further away from" a design that allows for other kinds of output level placement decisions, such as range-based input data hotness. However, properly handling range tombstones with such policies will likely require flexible placement into outputs, as this change introduces. Additional details: * For good code abstraction, separate CompactionIterator from the concern of where to place compaction outputs. CompactionIterator is supposed to provide a stream of entries, including the "best" sequence number we can assign to those entries. If it's safe and proper to zero out a sequence number, the placement of entries to outputs should deal with that safely rather than having complex inter-dependency between sequence number assignment and placement. To achieve this, we migrate all the compaction output placement logic that was in CompactionIterator to CompactionJob and similar. This unfortunately renders some unit tests (PerKeyPlacementCompIteratorTest) depending on the bad abstraction as obsolete, but tiered_compaction_test has pretty good coverage overall, catching many issues during this development. Intended follow-up: * See FIXME items in tiered_compaction_test * More testing / validation / support for tiering + UDT * Consider generalizing this work to split results at other levels as appropriate based on stats (auto-tuning essentially). Allowing only the last level to be cold is limiting. Test Plan: tests were added in previous changes (facebook#13244 facebook#13124), and updated here to reflect correct operation (with some known problems for leveled compaction)

facebook-github-bot · 2024-12-27T23:29:18Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pdillinger requested a review from cbi42 December 27, 2024 22:37

facebook-github-bot added the CLA Signed label Dec 27, 2024

pdillinger added 2 commits December 27, 2024 15:25

Fix release build

0040e25

Add to SetOptions in crash test

822f3ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework, simplify some tiering logic for mutable options #13256

Rework, simplify some tiering logic for mutable options #13256

pdillinger commented Dec 27, 2024 •

edited

Loading

facebook-github-bot commented Dec 27, 2024

Rework, simplify some tiering logic for mutable options #13256

Are you sure you want to change the base?

Rework, simplify some tiering logic for mutable options #13256

Conversation

pdillinger commented Dec 27, 2024 • edited Loading

facebook-github-bot commented Dec 27, 2024

pdillinger commented Dec 27, 2024 •

edited

Loading