Skip to content

Commit

Permalink
Rework, simplify some tiering logic for mutable options
Browse files Browse the repository at this point in the history
Summary: The primary goal of this change was to support full dynamic
mutability of options `preclude_last_level_data_seconds` and
`preserve_internal_time_seconds`, which was challenging because of
subtle design holes referenced from facebook#13124.

The fix is, in a sense, "doubling down" on the idea of write-time-based
tiering, by simplifying the output level decision with a single sequence
number threshold. This approach has some advantages:
* Allows option mutability in presence of long snapshots (or UDT)
* Simpler to believe correct because there's no special treatment for
  range tombstones, and output level assignment does not affect sequence
  number assignment to the entries (which takes some care to avoid
  circular dependency; see CompactionIterator stuff below).
* Avoids extra key comparisons, in `WithinPenultimateLevelOutputRange()`,
  in relevant compactions (more CPU efficient, though untested).

There are two big pieces/changes to enable this simplification to a
single `penultimate_after_seqno_` threshold:
* Allow range tombstones to be sent to either output level, based on
  sequence number.
* Use sequence numbers instead of range checks to avoid data in the last
  level from moving to penultimate level outside of the permissable
  range on that level (due to compaction selecting wider range in the
  later input level, which is the normal output level). With this
  change, data can only move "back up the LSM" when entire sorted runs
  are selected for comapction.

Possible disadvantages:
* Extra CPU to iterate over range tombstones in relevant compactions
  *twice* instead of once. However, work loads with lots of range tombstones
  relative to other entries should be rare.
* Data might not migrate back up the LSM tree on option changes as
  aggressively or consistently. This should a a rare concern, however,
  especially for universal compaction where selecting full sorted runs
  is normal compaction.
* This approach is arguably "further away from" a design that allows for
  other kinds of output level placement decisions, such as range-based
  input data hotness. However, properly handling range tombstones with
  such policies will likely require flexible placement into outputs,
  as this change introduces.

Additional details:
* For good code abstraction, separate CompactionIterator from the
  concern of where to place compaction outputs. CompactionIterator is
  supposed to provide a stream of entries, including the "best" sequence
  number we can assign to those entries. If it's safe and proper to zero
  out a sequence number, the placement of entries to outputs should deal
  with that safely rather than having complex inter-dependency between
  sequence number assignment and placement. To achieve this, we migrate
  all the compaction output placement logic that was in CompactionIterator
  to CompactionJob and similar. This unfortunately renders some unit
  tests (PerKeyPlacementCompIteratorTest) depending on the bad
  abstraction as obsolete, but tiered_compaction_test has pretty good
  coverage overall, catching many issues during this development.

Intended follow-up:
* See FIXME items in tiered_compaction_test
* More testing / validation / support for tiering + UDT
* Consider generalizing this work to split results at other levels as
  appropriate based on stats (auto-tuning essentially). Allowing only the
  last level to be cold is limiting.

Test Plan: tests were added in previous changes (facebook#13244 facebook#13124), and
updated here to reflect correct operation (with some known problems for
leveled compaction)
  • Loading branch information
pdillinger committed Dec 27, 2024
1 parent 02b4197 commit 7f7c0ca
Show file tree
Hide file tree
Showing 16 changed files with 266 additions and 369 deletions.
50 changes: 38 additions & 12 deletions db/compaction/compaction.cc
Original file line number Diff line number Diff line change
Expand Up @@ -410,6 +410,7 @@ Compaction::Compaction(

void Compaction::PopulatePenultimateLevelOutputRange() {
if (!SupportsPerKeyPlacement()) {
assert(keep_in_last_level_through_seqno_ == kMaxSequenceNumber);
return;
}

Expand Down Expand Up @@ -452,6 +453,27 @@ void Compaction::PopulatePenultimateLevelOutputRange() {
GetBoundaryInternalKeys(input_vstorage_, inputs_,
&penultimate_level_smallest_,
&penultimate_level_largest_, exclude_level);

if (penultimate_output_range_type_ !=
PenultimateOutputRangeType::kFullRange) {
// If not full range in penultimate level, must keep everything already
// in the last level there, because moving it back up might cause
// overlap/placement issues that are difficult to resolve properly in the
// presence of range deletes
SequenceNumber max_last_level_seqno = 0;
for (const auto& input_lvl : inputs_) {
if (input_lvl.level == output_level_) {
for (const auto& file : input_lvl.files) {
max_last_level_seqno =
std::max(max_last_level_seqno, file->fd.largest_seqno);
}
}
}

keep_in_last_level_through_seqno_ = max_last_level_seqno;
} else {
keep_in_last_level_through_seqno_ = 0;
}
}

Compaction::~Compaction() {
Expand Down Expand Up @@ -494,22 +516,26 @@ bool Compaction::OverlapPenultimateLevelOutputRange(
}

// key includes timestamp if user-defined timestamp is enabled.
bool Compaction::WithinPenultimateLevelOutputRange(
const ParsedInternalKey& ikey) const {
if (!SupportsPerKeyPlacement()) {
return false;
}
void Compaction::TEST_AssertWithinPenultimateLevelOutputRange(
const Slice& user_key, bool expect_failure) const {
#ifndef NDEBUG
assert(SupportsPerKeyPlacement());

if (penultimate_level_smallest_.size() == 0 ||
penultimate_level_largest_.size() == 0) {
return false;
}
assert(penultimate_level_smallest_.size() > 0);
assert(penultimate_level_largest_.size() > 0);

const InternalKeyComparator* icmp = input_vstorage_->InternalComparator();
auto* cmp = input_vstorage_->user_comparator();

// op_type of a key can change during compaction, e.g. Merge -> Put.
return icmp->CompareKeySeq(ikey, penultimate_level_smallest_.Encode()) >= 0 &&
icmp->CompareKeySeq(ikey, penultimate_level_largest_.Encode()) <= 0;
if (!(cmp->Compare(user_key, penultimate_level_smallest_.user_key()) >= 0)) {
assert(expect_failure);
} else if (!(cmp->Compare(user_key, penultimate_level_largest_.user_key()) <=
0)) {
assert(expect_failure);
} else {
assert(!expect_failure);
}
#endif
}

bool Compaction::InputCompressionMatchesOutput() const {
Expand Down
21 changes: 15 additions & 6 deletions db/compaction/compaction.h
Original file line number Diff line number Diff line change
Expand Up @@ -388,12 +388,12 @@ class Compaction {
bool OverlapPenultimateLevelOutputRange(const Slice& smallest_key,
const Slice& largest_key) const;

// Return true if the key is within penultimate level output range for
// per_key_placement feature, which is safe to place the key to the
// penultimate level. different compaction strategy has different rules.
// If per_key_placement is not supported, always return false.
// key includes timestamp if user-defined timestamp is enabled.
bool WithinPenultimateLevelOutputRange(const ParsedInternalKey& ikey) const;
// For testing purposes, check that a key is within penultimate level
// output range for per_key_placement feature, which is safe to place the key
// to the penultimate level. Different compaction strategies have different
// rules. `user_key` includes timestamp if user-defined timestamp is enabled.
void TEST_AssertWithinPenultimateLevelOutputRange(
const Slice& user_key, bool expect_failure = false) const;

CompactionReason compaction_reason() const { return compaction_reason_; }

Expand Down Expand Up @@ -456,6 +456,13 @@ class Compaction {
const ImmutableOptions& immutable_options, const int start_level,
const int output_level);

// If some data cannot be safely migrated "up" the LSM tree due to a change
// in the preclude_last_level_data_seconds setting, this indicates a sequence
// number for the newest data that must be kept in the last level.
SequenceNumber GetKeepInLastLevelThroughSeqno() const {
return keep_in_last_level_through_seqno_;
}

// mark (or clear) all files that are being compacted
void MarkFilesBeingCompacted(bool being_compacted) const;

Expand Down Expand Up @@ -605,6 +612,8 @@ class Compaction {
// Blob garbage collection age cutoff.
double blob_garbage_collection_age_cutoff_;

SequenceNumber keep_in_last_level_through_seqno_ = kMaxSequenceNumber;

// only set when per_key_placement feature is enabled, -1 (kInvalidLevel)
// means not supported.
const int penultimate_level_;
Expand Down
90 changes: 6 additions & 84 deletions db/compaction/compaction_iterator.cc
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,7 @@ CompactionIterator::CompactionIterator(
const std::atomic<bool>* shutting_down,
const std::shared_ptr<Logger> info_log,
const std::string* full_history_ts_low,
const SequenceNumber preserve_time_min_seqno,
const SequenceNumber preclude_last_level_min_seqno)
std::optional<SequenceNumber> preserve_seqno_min)
: CompactionIterator(
input, cmp, merge_helper, last_sequence, snapshots, earliest_snapshot,
earliest_write_conflict_snapshot, job_snapshot, snapshot_checker, env,
Expand All @@ -48,8 +47,7 @@ CompactionIterator::CompactionIterator(
manual_compaction_canceled,
compaction ? std::make_unique<RealCompaction>(compaction) : nullptr,
must_count_input_entries, compaction_filter, shutting_down, info_log,
full_history_ts_low, preserve_time_min_seqno,
preclude_last_level_min_seqno) {}
full_history_ts_low, preserve_seqno_min) {}

CompactionIterator::CompactionIterator(
InternalIterator* input, const Comparator* cmp, MergeHelper* merge_helper,
Expand All @@ -67,8 +65,7 @@ CompactionIterator::CompactionIterator(
const std::atomic<bool>* shutting_down,
const std::shared_ptr<Logger> info_log,
const std::string* full_history_ts_low,
const SequenceNumber preserve_time_min_seqno,
const SequenceNumber preclude_last_level_min_seqno)
std::optional<SequenceNumber> preserve_seqno_min)
: input_(input, cmp, must_count_input_entries),
cmp_(cmp),
merge_helper_(merge_helper),
Expand Down Expand Up @@ -109,10 +106,9 @@ CompactionIterator::CompactionIterator(
current_key_committed_(false),
cmp_with_history_ts_low_(0),
level_(compaction_ == nullptr ? 0 : compaction_->level()),
preserve_time_min_seqno_(preserve_time_min_seqno),
preclude_last_level_min_seqno_(preclude_last_level_min_seqno) {
preserve_seqno_after_(preserve_seqno_min.value_or(earliest_snapshot)) {
assert(snapshots_ != nullptr);
assert(preserve_time_min_seqno_ <= preclude_last_level_min_seqno_);
assert(preserve_seqno_after_ <= earliest_snapshot_);

if (compaction_ != nullptr) {
level_ptrs_ = std::vector<size_t>(compaction_->number_levels(), 0);
Expand Down Expand Up @@ -1017,7 +1013,6 @@ void CompactionIterator::NextFromInput() {
} else {
if (ikey_.sequence != 0) {
iter_stats_.num_timed_put_swap_preferred_seqno++;
saved_seq_for_penul_check_ = ikey_.sequence;
ikey_.sequence = preferred_seqno;
}
ikey_.type = kTypeValue;
Expand Down Expand Up @@ -1258,71 +1253,6 @@ void CompactionIterator::GarbageCollectBlobIfNeeded() {
}
}

void CompactionIterator::DecideOutputLevel() {
assert(compaction_->SupportsPerKeyPlacement());
output_to_penultimate_level_ = false;
// if the key is newer than the cutoff sequence or within the earliest
// snapshot, it should output to the penultimate level.
if (ikey_.sequence > preclude_last_level_min_seqno_ ||
ikey_.sequence > earliest_snapshot_) {
output_to_penultimate_level_ = true;
}

#ifndef NDEBUG
// Could be overridden by unittest
PerKeyPlacementContext context(level_, ikey_.user_key, value_, ikey_.sequence,
output_to_penultimate_level_);
TEST_SYNC_POINT_CALLBACK("CompactionIterator::PrepareOutput.context",
&context);
if (ikey_.sequence > earliest_snapshot_) {
output_to_penultimate_level_ = true;
}
#endif // NDEBUG

// saved_seq_for_penul_check_ is populated in `NextFromInput` when the
// entry's sequence number is non zero and validity context for output this
// entry is kSwapPreferredSeqno for use in `DecideOutputLevel`. It should be
// cleared out here unconditionally. Otherwise, it may end up getting consumed
// incorrectly by a different entry.
SequenceNumber seq_for_range_check =
(saved_seq_for_penul_check_.has_value() &&
saved_seq_for_penul_check_.value() != kMaxSequenceNumber)
? saved_seq_for_penul_check_.value()
: ikey_.sequence;
saved_seq_for_penul_check_ = std::nullopt;
ParsedInternalKey ikey_for_range_check = ikey_;
if (seq_for_range_check != ikey_.sequence) {
ikey_for_range_check.sequence = seq_for_range_check;
}
if (output_to_penultimate_level_) {
// If it's decided to output to the penultimate level, but unsafe to do so,
// still output to the last level. For example, moving the data from a lower
// level to a higher level outside of the higher-level input key range is
// considered unsafe, because the key may conflict with higher-level SSTs
// not from this compaction.
// TODO: add statistic for declined output_to_penultimate_level
bool safe_to_penultimate_level =
compaction_->WithinPenultimateLevelOutputRange(ikey_for_range_check);
if (!safe_to_penultimate_level) {
output_to_penultimate_level_ = false;
// It could happen when disable/enable `last_level_temperature` while
// holding a snapshot. When `last_level_temperature` is not set
// (==kUnknown), the data newer than any snapshot is pushed to the last
// level, but when the per_key_placement feature is enabled on the fly,
// the data later than the snapshot has to be moved to the penultimate
// level, which may or may not be safe. So the user needs to make sure all
// snapshot is released before enabling `last_level_temperature` feature
// We will migrate the feature to `last_level_temperature` and maybe make
// it not dynamically changeable.
if (seq_for_range_check > earliest_snapshot_) {
status_ = Status::Corruption(
"Unsafe to store Seq later than snapshot in the last level if "
"per_key_placement is enabled");
}
}
}
}

void CompactionIterator::PrepareOutput() {
if (Valid()) {
if (LIKELY(!is_range_del_)) {
Expand All @@ -1331,13 +1261,6 @@ void CompactionIterator::PrepareOutput() {
} else if (ikey_.type == kTypeBlobIndex) {
GarbageCollectBlobIfNeeded();
}

// For range del sentinel, we don't use it to cut files for bottommost
// compaction. So it should not make a difference which output level we
// decide.
if (compaction_ != nullptr && compaction_->SupportsPerKeyPlacement()) {
DecideOutputLevel();
}
}

// Zeroing out the sequence number leads to better compression.
Expand All @@ -1355,8 +1278,7 @@ void CompactionIterator::PrepareOutput() {
!compaction_->allow_ingest_behind() && bottommost_level_ &&
DefinitelyInSnapshot(ikey_.sequence, earliest_snapshot_) &&
ikey_.type != kTypeMerge && current_key_committed_ &&
!output_to_penultimate_level_ &&
ikey_.sequence < preserve_time_min_seqno_ && !is_range_del_) {
ikey_.sequence <= preserve_seqno_after_ && !is_range_del_) {
if (ikey_.type == kTypeDeletion ||
(ikey_.type == kTypeSingleDeletion && timestamp_size_ == 0)) {
ROCKS_LOG_FATAL(
Expand Down
Loading

0 comments on commit 7f7c0ca

Please sign in to comment.