Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE-8161] storage: add min.cleanable.dirty.ratio, schedule compaction by dirty_ratio #24991

Open
wants to merge 11 commits into
base: dev
Choose a base branch
from

Conversation

WillemKauf
Copy link
Contributor

@WillemKauf WillemKauf commented Jan 31, 2025

Based on PR #24649.

WIP, tests to be added.

Instead of unconditionally compacting all logs during a round of housekeeping, users may now optionally schedule log compaction in the log_manager using the cluster/topic property min_cleanable_dirty_ratio/min.cleanable.dirty.ratio.

As mentioned in the above PR,

The dirty ratio of a log is defined as the ratio between the number of bytes in "dirty" segments and the total number of bytes in closed segments.
Dirty segments are closed segments which have not yet been cleanly compacted- i.e, duplicates for keys in this segment could be found in the prefix of the log up to this segment.

By setting the min.cleanable.dirty.ratio on a per topic basis, users can avoid unnecessary read/write amplification during compaction as the log grows in size.

A housekeeping scan will still be performed every log_compaction_interval_ms, and the log's dirty_ratio will be tested against min.cleanable.dirty.ratio in determining it's eligibility for compaction. Additionally, logs are now compacted in descending order according to their dirty ratio, offering a better "bang for buck" heuristic for compaction scheduling.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

Improvements

  • Allow for optional scheduling of compaction via min_cleanable_dirty_ratio

Adds the book-keeping variables `_dirty/closed_segment_bytes` to
`disk_log_impl`, as well as some getter/setter functions.

These functions will be used throughout `disk_log_impl` where required
(segment rolling, compaction, segment eviction) to track the bytes
contained in dirty and closed segments.
Uses the added functions `update_dirty/closed_segment_bytes()`
in the required locations within `disk_log_impl` in order
to bookkeep the dirty ratio.

Bytes can be either removed or added by rolling new segments,
compaction, and retention enforcement.
@WillemKauf WillemKauf requested a review from a team as a code owner January 31, 2025 17:48
@WillemKauf WillemKauf force-pushed the min_cleanable_dirty_ratio branch 2 times, most recently from db600d4 to 78b2928 Compare January 31, 2025 21:02
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 1, 2025

Retry command for Build#61464

please wait until all jobs are finished before running the slash command


/ci-repeat 1
tests/rptest/tests/describe_topics_test.py::DescribeTopicsTest.test_describe_topics_with_documentation_and_types
tests/rptest/tests/partition_force_reconfiguration_test.py::NodeWiseRecoveryTest.test_node_wise_recovery@{"dead_node_count":2}

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 1, 2025

CI test results

test results on build#61464
test_id test_kind job_url test_status passed
gtest_raft_rpunit.gtest_raft_rpunit unit https://buildkite.com/redpanda/redpanda/builds/61464#0194be30-4fe3-4201-a506-11eca43a36e9 FLAKY 1/2
kafka_server_rpfixture.kafka_server_rpfixture unit https://buildkite.com/redpanda/redpanda/builds/61464#0194be30-4fe3-4201-a506-11eca43a36e9 FAIL 0/2
rptest.tests.cloud_storage_scrubber_test.CloudStorageScrubberTest.test_scrubber.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/61464#0194be87-f8fb-4fbd-9207-0641084924fe FLAKY 1/2
rptest.tests.describe_topics_test.DescribeTopicsTest.test_describe_topics_with_documentation_and_types ducktape https://buildkite.com/redpanda/redpanda/builds/61464#0194be87-f8f9-4197-b75d-9e2b0e669087 FAIL 0/20
rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/61464#0194be87-f8fb-4fbd-9207-0641084924fe FLAKY 1/2
rptest.tests.partition_force_reconfiguration_test.NodeWiseRecoveryTest.test_node_wise_recovery.dead_node_count=2 ducktape https://buildkite.com/redpanda/redpanda/builds/61464#0194be87-f8fb-4fbd-9207-0641084924fe FAIL 0/20
rptest.tests.retention_policy_test.RetentionPolicyTest.test_changing_topic_retention_with_restart.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/61464#0194be87-f8f9-4197-b75d-9e2b0e669087 FLAKY 1/3
test_compat_rpunit.test_compat_rpunit unit https://buildkite.com/redpanda/redpanda/builds/61464#0194be2d-b364-4273-8107-604c1a24978e FAIL 0/2
test results on build#61480
test_id test_kind job_url test_status passed
gtest_raft_rpunit.gtest_raft_rpunit unit https://buildkite.com/redpanda/redpanda/builds/61480#0194c30c-66c7-425f-882b-a62f163db808 FLAKY 1/2
gtest_raft_rpunit.gtest_raft_rpunit unit https://buildkite.com/redpanda/redpanda/builds/61480#0194c30c-66c8-4db9-a2da-7dc5f0262638 FLAKY 1/2
rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/61480#0194c368-0af7-4d62-86e0-9ef129bfeb88 FLAKY 1/2
rptest.tests.offset_for_leader_epoch_archival_test.OffsetForLeaderEpochArchivalTest.test_querying_archive ducktape https://buildkite.com/redpanda/redpanda/builds/61480#0194c368-0af6-4ff1-92d6-ad77df6dc1f7 FLAKY 1/5
rptest.tests.partition_force_reconfiguration_test.NodeWiseRecoveryTest.test_node_wise_recovery.dead_node_count=2 ducktape https://buildkite.com/redpanda/redpanda/builds/61480#0194c368-0af7-4d62-86e0-9ef129bfeb88 FAIL 0/20

@WillemKauf WillemKauf force-pushed the min_cleanable_dirty_ratio branch from 78b2928 to 7a6aab3 Compare February 1, 2025 19:43
@WillemKauf WillemKauf changed the title storage: schedule compaction by dirty_ratio storage: add min.cleanable.dirty.ratio, schedule compaction by dirty_ratio Feb 1, 2025
@WillemKauf WillemKauf changed the title storage: add min.cleanable.dirty.ratio, schedule compaction by dirty_ratio [CORE-8161] storage: add min.cleanable.dirty.ratio, schedule compaction by dirty_ratio Feb 1, 2025
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#61480

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/partition_force_reconfiguration_test.py::NodeWiseRecoveryTest.test_node_wise_recovery@{"dead_node_count":2}

src/v/storage/probe.cc Show resolved Hide resolved
Comment on lines +327 to +330
std::optional<double>
metadata_cache::get_default_min_cleanable_dirty_ratio() const {
return config::shard_local_cfg().min_cleanable_dirty_ratio();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm wondering if this is something that we want to expose as a topic config right now, or we keep it internal for a while?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you would like to keep this config internal for the time being?

I could see an argument for releasing all of min.cleanable.dirty.ratio/min.compaction.lag.ms/max.compaction.lag.ms as configs all at once, but I also think there is use for min.cleanable.dirty.ratio as a standalone property for now.

Is this your reasoning as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you would like to keep this config internal for the time being?

just that no one is asking for it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There have been some internal discussions around the benefits of min.cleanable.dirty.ratio reducing read/write amplification, but fair point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, makes sense. presumably you could bail out of the loop early based on some condition. for example, the while loop already does this in the while condition. maybe there isn't a natural way to do that, i'm not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can just check again for the existence of the ntp in the ntp_by_compaction_heuristic map, shift the list and continue without compacting so we eventually hit the "front" of the list again. Sub-optimal but least intrusive change.

@@ -260,29 +293,28 @@ log_manager::housekeeping_scan(model::timestamp collection_threshold) {
co_await compaction_map->initialize(compaction_mem_bytes);
_compaction_hash_key_map = std::move(compaction_map);
}
while (!_logs_list.empty()
&& is_not_set(_logs_list.front().flags, bflags::compacted)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't see us checking/changing the compacted flags now. is that changed?

alternatively, did you consider just sorting _logs_list? you could leave this while loop unchanged, and before it, iterate over the btree, use get() to access the intrusive hook, and then in O(1) move the entries to the head of the list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea, thanks. Wasn't sure of the best way to manage both of the housekeeping meta as well as the sorted ntps. This sounds like a good solution!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your changes look ok too, but we've had a bunch of concurrency bugs related to that intrusive list in the past. i think it'd be nice to just sort it how we want, and otherwise leave it untouched. wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was under the impression that it woudl be fine to compact any partition in that loop, we were only interested in which order compaction occurred

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this PR, the changes (as they stand) do two things:

  1. The housekeeping scan no longer indiscriminately compacts every partitions, but instead only compacts partitions with a dirty_ratio above min.cleanable.dirty.ratio (if it is set at the cluster or topic level).
  2. The ordering of compaction for partitions is decided using the dirty_ratio as a heuristic.

So, we aren't looking to compact every partition in that loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it looks like my original comment got deleted accidentally, for context I wrote:

I agree with your proposed changes, I'm just wondering how the while loop would remain unchanged when we no longer are looking to compact every log in _logs_list (i.e, the logic of is_not_set() for the front entry of _logs_list returning false no longer applies.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh i think we could bail out of the loop (or skip items)?

, min_cleanable_dirty_ratio(
*this,
"min_cleanable_dirty_ratio",
"The minimum ratio between between the number of bytes in \"dirty\" "
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The minimum ratio between between the number of bytes in \"dirty\" "
"The minimum ratio between the number of bytes in \"dirty\" "

"min_cleanable_dirty_ratio",
"The minimum ratio between between the number of bytes in \"dirty\" "
"segments and the total number of bytes in closed segments that must be "
"reached before a partition's log is considered eligible for compaction, "
Copy link

@asimms41 asimms41 Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"reached before a partition's log is considered eligible for compaction, "
"reached before a partition's log is eligible for compaction "

config_type="DOUBLE",
value="-1",
doc_string=
"The minimum ratio between between the number of bytes in \"dirty\" "
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The minimum ratio between between the number of bytes in \"dirty\" "
"The minimum ratio between the number of bytes in \"dirty\" "

doc_string=
"The minimum ratio between between the number of bytes in \"dirty\" "
"segments and the total number of bytes in closed segments that must be "
"reached before a partition's log is considered eligible for compaction, "
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"reached before a partition's log is considered eligible for compaction, "
"reached before a partition's log is eligible for compaction, "
Suggested change
"reached before a partition's log is considered eligible for compaction, "
"reached before a partition's log is considered eligible for compaction "

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants