Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support writes during ingestion #400

Open
wants to merge 2 commits into
base: 8.10.tikv
Choose a base branch
from

Conversation

hhwyt
Copy link

@hhwyt hhwyt commented Jan 7, 2025

This PR adds an option allow_write to IngestExternalFileOptions, enabling writes to the database during the ingestion process.

More details at tikv/tikv#18096.

@ti-chi-bot ti-chi-bot bot added dco-signoff: no Indicates the PR's author has not signed dco. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 7, 2025
@hhwyt hhwyt changed the title Support ingest without pausing writes Support writes during ingestion Jan 7, 2025
@hhwyt hhwyt force-pushed the pure-no-pause-write branch from 7340dac to 82ad5dc Compare January 7, 2025 14:29
@ti-chi-bot ti-chi-bot bot added dco-signoff: yes Indicates the PR's author has signed the dco. and removed dco-signoff: no Indicates the PR's author has not signed dco. labels Jan 7, 2025
@hhwyt
Copy link
Author

hhwyt commented Jan 7, 2025

/cc @glorv @LykxSassinator @Connor1996 @v01dstar PTAL, thx~

@@ -2095,6 +2095,9 @@ struct IngestExternalFileOptions {
//
// ingest_behind takes precedence over fail_if_not_bottommost_level.
bool fail_if_not_bottommost_level = false;
// Set to TRUE if user wants to allow writes to the DB during ingestion.
// User must ensure no writes overlap with the ingested data.
bool allow_write = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow_write is a little confusing, how about allow_foreground_write

Copy link
Author

@hhwyt hhwyt Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naming style aligns well with RocksDB’s conventions. In RocksDB, similar names, such as unordered_write and WaitForPendingWrites, typically use “write” to refer to foreground writes.

if (two_write_queues_) {
nonmem_write_thread_.ExitUnbatched(&nonmem_w);
}
write_thread_.ExitUnbatched(&w);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just skip L5856-5869 when allowing write

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with allow_write = true, writes to the DB must be temporarily stopped to wait for pending writes. This is because allow_write = true only requires users to ensure no concurrent writes overlap with the ingestion data and does not require ensuring no overlapping unordered_write before ingestion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@v01dstar @Connor1996 PTAL again, thx~

Copy link
Author

@hhwyt hhwyt Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can skip lines L5856-5869 and treat unfinished unordered_write as concurrent writes. Users need to ensure unordered_writes are finished before ingestion. This approach might better align with the original intent of allow_write, as it offers better performance when users can guarantee no concurrent writes. TiKV seems to be able to guarantee this. @v01dstar @Connor1996 What do you think of this?

Copy link

@v01dstar v01dstar Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, in theory, we can ignore those preceding-queued-unfinished foreground writes.

However, I don't have absolute confidence, since this change made many "assumptions". Just a side note, we probably need to test it against Jepsen test suite.

Copy link
Author

@hhwyt hhwyt Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, assuming all unordered_writes have finished is somewhat risky and introduces additional testing work. And, waiting for unordered_writes to finish before ingestion should have minimal impact on performance(since the WaitForPendingWrites is fast) and providing higher safety.

Therefore, the approach of waiting-for-unfinished-unordered_writes is more pragmatic.

@ti-chi-bot ti-chi-bot bot added dco-signoff: no Indicates the PR's author has not signed dco. and removed dco-signoff: yes Indicates the PR's author has signed the dco. labels Jan 8, 2025
Signed-off-by: hhwyt <[email protected]>
@hhwyt hhwyt force-pushed the pure-no-pause-write branch from 6a859b7 to b695598 Compare January 8, 2025 03:01
@ti-chi-bot ti-chi-bot bot added dco-signoff: yes Indicates the PR's author has signed the dco. and removed dco-signoff: no Indicates the PR's author has not signed dco. labels Jan 8, 2025
bool allow_write = args[0].options.allow_write;
for (const auto& arg : args) {
if (arg.options.allow_write != allow_write) {
return Status::InvalidArgument(
Copy link

@glorv glorv Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better fallback to the default behavior if any of this options is false. Return error here will cause tikv panic in some code paths.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not necessary. Firstly, from RocksDB’s perspective, the checks should be as strict as possible. Secondly, TiKV does not currently use the IngestExternalFiles interface directly; it only uses IngestExternalFile, which calls IngestExternalFiles with a single arg, i.e., only one allow_write. Therefore, this would not affect TiKV.

Copy link

@v01dstar v01dstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Jan 9, 2025
Copy link

ti-chi-bot bot commented Jan 9, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-01-09 03:56:01.542220825 +0000 UTC m=+412304.831052532: ☑️ agreed by v01dstar.

Copy link

ti-chi-bot bot commented Jan 9, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: v01dstar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the approved label Jan 9, 2025
Copy link
Member

@Connor1996 Connor1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It comes to me that titan GC and compaction filter GC may write back some kvs.
Considering the case, there is a peer before, then it's moved to other stores. And it's moved back to this store again.
Titan GC would be okay as it won't write back any overlapping key after the peer is destroyed. While compaction filter GC may perform some deletion as the same time. Please consider compaction filter GC case as well.

@hhwyt
Copy link
Author

hhwyt commented Jan 9, 2025

itan GC would be okay as it won't write back any overlapping key after the peer is destroyed

@Connor1996 How does Titan GC avoid writing back any overlapping keys after the peer is destroyed?

@Connor1996
Copy link
Member

itan GC would be okay as it won't write back any overlapping key after the peer is destroyed

@Connor1996 How does Titan GC avoid writing back any overlapping keys after the peer is destroyed?

All keys in that range are deleted when destroying. And Titan GC find keys are deleted and just skip that key for generating new blob file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved dco-signoff: yes Indicates the PR's author has signed the dco. needs-1-more-lgtm Indicates a PR needs 1 more LGTM. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants