-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CELEBORN-1452] Master follower node metadata is out of sync after installing snapshot #2547
Conversation
@AngersZhuuuu @waitinfuture @SzyWilliam @pan3793 Can you help review? |
UT shows how to reproduce |
From ArithmeticStateMachine when reinitialize() |
The latest snapshot file can not found in this UT. |
@@ -114,6 +98,7 @@ public void initialize(RaftServer server, RaftGroupId id, RaftStorage raftStorag | |||
public void reinitialize() throws IOException { | |||
LOG.info("Reinitializing state machine."); | |||
getLifeCycle().compareAndTransition(PAUSED, STARTING); | |||
storage.refreshLatestSnapshot(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this
final SingleFileSnapshotInfo s = latestSnapshot.get(); | ||
if (s != null) { | ||
return s; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we can directly remove this three line
Alluxio fix the same issue in Alluxio/alluxio#12181 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and pls check failed GA test
Seems flaky test which not caused by this PR. |
...va/org/apache/celeborn/service/deploy/master/clustermeta/ha/CelebornStateMachineStorage.java
Outdated
Show resolved
Hide resolved
...va/org/apache/celeborn/service/deploy/master/clustermeta/ha/CelebornStateMachineStorage.java
Show resolved
Hide resolved
...va/org/apache/celeborn/service/deploy/master/clustermeta/ha/CelebornStateMachineStorage.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. I left the minor comments for this pull request. BTW, could the copied class be removed after apache/ratis#1111 released?
Of course it can be removed after apache/ratis#1111 released. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Compile failed caused by CELEBORN-1337 |
@leixm, please rebase the latest main branch. |
…stalling snapshot
All done. |
…stalling snapshot ### What changes were proposed in this pull request? Fix Master follower node metadata is out of sync after installing snapshot ### Why are the changes needed? Follower node metadata is out of sync, when a master-slave switchover occurs, there are major risks to the stability of the cluster. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT. Closes #2547 from leixm/issue_1452. Authored-by: Xianming Lei <[email protected]> Signed-off-by: SteNicholas <[email protected]> (cherry picked from commit cb30e91) Signed-off-by: SteNicholas <[email protected]>
Thanks. Merged to main(v0.6.0) and branch-0.5(0.5.1)/branch-0.4(0.4.2). |
…stalling snapshot ### What changes were proposed in this pull request? Fix Master follower node metadata is out of sync after installing snapshot ### Why are the changes needed? Follower node metadata is out of sync, when a master-slave switchover occurs, there are major risks to the stability of the cluster. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT. Closes apache#2547 from leixm/issue_1452. Authored-by: Xianming Lei <[email protected]> Signed-off-by: SteNicholas <[email protected]>
…er installing snapshot ### What changes were proposed in this pull request? backport #2547 to `branch-0.4` Fix Master follower node metadata is out of sync after installing snapshot ### Why are the changes needed? Follower node metadata is out of sync, when a master-slave switchover occurs, there are major risks to the stability of the cluster. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT. Closes #2563 from cfmcgrady/CELEBORN-1452-branch-0.4. Lead-authored-by: Xianming Lei <[email protected]> Co-authored-by: Fu Chen <[email protected]> Signed-off-by: Shuang <[email protected]>
…stalling snapshot ### What changes were proposed in this pull request? Fix Master follower node metadata is out of sync after installing snapshot ### Why are the changes needed? Follower node metadata is out of sync, when a master-slave switchover occurs, there are major risks to the stability of the cluster. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT. Closes apache#2547 from leixm/issue_1452. Authored-by: Xianming Lei <[email protected]> Signed-off-by: SteNicholas <[email protected]> (cherry picked from commit cb30e91)
### What changes were proposed in this pull request? Bump Ratis version from 3.0.1 to 3.1.0. Meanwhile, remove `CelebornStateMachineStorage` with the release of apache/ratis#1111. ### Why are the changes needed? Bump Ratis version from 3.0.1 to 3.1.0. Ratis has released v3.1.0, of which release note refers to [3.1.0](https://ratis.apache.org/post/3.1.0.html). The 3.1.0 version is a minor release with multiple improvements and bugfixes including [[RATIS-2111] Reinitialize should load the latest snapshot](https://issues.apache.org/jira/browse/RATIS-2111). See the [changes between 3.0.1 and 3.1.0](apache/ratis@ratis-3.0.1...ratis-3.1.0) releases. Follow up #2547. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `MasterStateMachineSuiteJ#testInstallSnapshot` Closes #2610 from SteNicholas/CELEBORN-1499. Authored-by: SteNicholas <[email protected]> Signed-off-by: Shuang <[email protected]>
What changes were proposed in this pull request?
Fix Master follower node metadata is out of sync after installing snapshot
Why are the changes needed?
Follower node metadata is out of sync, when a master-slave switchover occurs, there are major risks to the stability of the cluster.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
UT.