Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(meta): decouple barrier collect and sync in global barrier manager #19475

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

wenym1
Copy link
Contributor

@wenym1 wenym1 commented Nov 20, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Previously in global barrier manager, barriers are handled in 2 phases. In phase 1, a barrier is injected to CNs, and waits for the BarrierCompleteResponse from the injected CNs. When the response is received, the barrier has been collected from all actors in the CN, and the epoch of the barrier has been synced and the SST files are included in the response. In phase 2, the SST files will be committed to hummock manager, and the barrier command will commit the change to fragment and catalog manager if there is any.

In partial checkpoint, barriers are injected to multiple partial graphs, and then multiple partial graphs can be synced and then committed together. Therefore, in this PR, we will decouple the barrier collection and sync in global barrier manager. In general, a barrier will have 3 phase, collect, complete, and commit. A barrier will be handled in the following steps:

  1. The barrier is injected to CNs in multiple partial graphs and enters the collect phase to wait for the BarrierCollectResponse. The barrier injection and collection is independent in different partial graphs.
  2. For all the partial graphs and epochs that have collected the BarrierCollectResponse from all CNs, the global barrier worker will generate a CompleteBarrierTask that includes them together. These partial graphs and epochs will enter the complete phase. BarrierCompleteRequest will be sent to CNs and wait for the responses. When CN receives the request, multiple partial graphs will be synced together.
  3. When the BarrierCompleteResponse is received from all CNs, the partial graphs enters the commit phase and will be committed together to hummock, fragment manager and catalog manager.

Note:

  • Non-checkpoint barriers only have collect phase, because it has nothing to do in the complete and commit phase.
  • Basically, this PR divides the previous inject-collect barrier into the two collect and complete phase. The previous single CompleteBarrierResponse proto message will also be divided. The previous create_mview_progress is moved to the CollectBarrierResponse.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

Base automatically changed from yiming/decouple-barrier-collect-and-sync to main November 20, 2024 08:32
impl<C: GlobalBarrierWorkerContext> GlobalBarrierWorker<C> {
/// We need to make sure there are no changes when doing recovery
pub async fn clear_on_err(&mut self, err: &MetaError) {
// TODO: move this method to `complete_task.rs` and mark some structs and fields as private before merge
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO

@graphite-app graphite-app bot requested a review from a team November 21, 2024 12:36
@wenym1 wenym1 force-pushed the yiming/global-barrier-decouple-collect-sync branch from f311539 to bb4e0f0 Compare November 21, 2024 17:10
@stdrc
Copy link
Member

stdrc commented Nov 22, 2024

Is this ready for review? Please add some description.

@wenym1
Copy link
Contributor Author

wenym1 commented Nov 22, 2024

Is this ready for review? Please add some description.

Not yet. Still running some tests. Will ping reviewers when it's ready for review

@wenym1 wenym1 force-pushed the yiming/global-barrier-decouple-collect-sync branch from 596b11c to 76a77b4 Compare November 22, 2024 07:42
Copy link
Contributor Author

wenym1 commented Nov 22, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

@wenym1 wenym1 changed the title feat(meta): collect and sync barrier in two requests feat(meta): decouple barrier collect and sync in global barrier manager Nov 22, 2024
@wenym1
Copy link
Contributor Author

wenym1 commented Nov 25, 2024

Temporarily mark this PR as draft.

Database isolation will be implemented in favor of #19556 and its subsequent PRs.

@wenym1 wenym1 marked this pull request as draft November 25, 2024 08:04
@wenym1 wenym1 force-pushed the yiming/global-barrier-decouple-collect-sync branch from 119a250 to 112f834 Compare November 27, 2024 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants