Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1452] Master follower node metadata is out of sync after installing snapshot #2547

Closed
wants to merge 1 commit into from

Conversation

leixm
Copy link
Contributor

@leixm leixm commented Jun 7, 2024

What changes were proposed in this pull request?

Fix Master follower node metadata is out of sync after installing snapshot

Why are the changes needed?

Follower node metadata is out of sync, when a master-slave switchover occurs, there are major risks to the stability of the cluster.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT.

@leixm
Copy link
Contributor Author

leixm commented Jun 7, 2024

follower log
24/05/23 03:08:02,421 INFO [1@group-47BEDE733167-StateMachineUpdater] StateMachine: Reinitializing state machine.
24/05/23 03:08:02,421 INFO [1@group-47BEDE733167-StateMachineUpdater] StateMachine: obsolete snapshot provided: (t:28, i:1185003348)

24/05/23 03:06:53,439 WARN [master-state-machine-executor] AbstractMetaManager: Celeborn cluster estimated partition size changed from 129.2 MiB to 110.5 MiB
24/05/23 03:08:01,425 INFO [1@group-47BEDE733167-StateMachineUpdater] StateMachine: Taking a snapshot to file /mnt/ssd/0/celeborn-ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/raft_snapshot_1716404881416_5912151569428133017.dat.
24/05/23 03:08:01,428 INFO [1@group-47BEDE733167-StateMachineUpdater] StateMachine: Saving digest 184db806c56d5780f92c98fd7d3fb35a for snapshot file /mnt/ssd/0/celeborn-ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm/snapshot.28_1185003348.
24/05/23 03:08:01,428 INFO [1@group-47BEDE733167-StateMachineUpdater] StateMachine: Renaming a snapshot file /mnt/ssd/0/celeborn-ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/raft_snapshot_1716404881416_5912151569428133017.dat to /mnt/ssd/0/celeborn-ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm/snapshot.28_1185003348.
24/05/23 03:08:02,421 INFO [1@group-47BEDE733167-StateMachineUpdater] StateMachine: Reinitializing state machine.
24/05/23 03:08:02,421 INFO [1@group-47BEDE733167-StateMachineUpdater] StateMachine: obsolete snapshot provided: (t:28, i:1185003348)

leader log
24/05/23 03:08:02,117 INFO [2@group-47BEDE733167-StateMachineUpdater] StateMachine: Taking a snapshot to file /mnt/ssd/0/celeborn-ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/raft_snapshot_1716404882108_1183932657463440669.dat.
24/05/23 03:08:02,120 INFO [2@group-47BEDE733167-StateMachineUpdater] StateMachine: Saving digest 232b855f8ca673b7a2c38a8ec90c6306 for snapshot file /mnt/ssd/0/celeborn-ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm/snapshot.28_1185003493.
24/05/23 03:08:02,121 INFO [2@group-47BEDE733167-StateMachineUpdater] StateMachine: Renaming a snapshot file /mnt/ssd/0/celeborn-ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/raft_snapshot_1716404882108_1183932657463440669.dat to /mnt/ssd/0/celeborn-ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm/snapshot.28_1185003493.

After reinitializing, follower out of sync.
meta

There are some variables find from follower jvm heap

a. StateMachineUpdater.appliedIndex = 1185003348
b. SegmentedRaftLog.committedIndex = 1188743857
c. SegmentedRaftLog startIndex = 1185003494, endIndex = 1188743857
d. StateMachineUpdater.snapshotIndex = 1185003348
e. ServerState.latestedInstalledSnapshot=1185003493

It can be seen that for follower heap, 145 logs from 1185003349 ~ 1185003493 have been lost.

  1. When will reinitializing be triggered?
    It is triggered after leader sends a snapshot to the follower and finish installing.

  2. When does the leader send a snapshot to the followers in the running state?
    When the follower's nextIndex < the leader's startIndex, that is, the log required by the follower no longer exists in the leader.
    2.1 Why does the log not exist in the leader?
    There is a certain synchronization delay between the leader and the folower. When the leader takes the snapshot first, it will purge its own segment log after taking it. At this time, the log with index <= snapshot index can be purged, because there is already a snapshot. At this time, the above judgment will be Take effect and then send the snapshot

org.apache.ratis.server.leader.LogAppender#shouldInstallSnapshot

final long followerNextIndex = getFollower().getNextIndex();
if (followerNextIndex < getRaftLog().getNextIndex()) {
    final long logStartIndex = getRaftLog().getStartIndex();
    if (followerNextIndex < logStartIndex || (logStartIndex == RaftLog.INVALID_LOG_INDEX && snapshot != null)) {
        return snapshot;
    }
}
  1. Why is the snapshot sent by the leader to the follower not load correctly?
    When the follower receives the snapshot, it will write to snapshotDir, but it will not refresh the latestSnapshot, so it cannot see this snapshot when reinitialize. snapshot.getTermIndex().compareTo(getLastAppliedTermIndex()) <= 0 must be true

  2. Why is the final metadata of the follower out of sync?
    After the leader sends the snapshot and receives the success status, it will update the follower's nextIndex to the sent snapshot last index + 1. The leader thinks that the follower has correctly installed snapsohot, but in fact it has not, which ultimately causes the follower to miss a section of the log. If the follower cannot find the log, it will lose synchronization forever.

org.apache.ratis.server.impl.StateMachineUpdater#applyLog

final LogEntryProto next = raftLog.get(nextIndex);
if (next != null) {xxx}
else {
LOG.debug("{}: logEntry {} is null. There may be snapshot to load. state:{}",
this, nextIndex, state);
break;
}
  1. How do we solve this problem?
    We need to modify the SimpleStateMachineStorage currently used by celeborn. When the snapshot install is completed, we should select the latest snapshot from the local disk directory and update the latestSnapshot variable.

@leixm
Copy link
Contributor Author

leixm commented Jun 7, 2024

@AngersZhuuuu @waitinfuture @SzyWilliam @pan3793 Can you help review?

@leixm
Copy link
Contributor Author

leixm commented Jun 7, 2024

UT shows how to reproduce

@AngersZhuuuu
Copy link
Contributor

From ArithmeticStateMachine when reinitialize()
https://github.com/apache/ratis/blob/dac27e4fd3df042fcf50ed549404e4d82708ed2b/ratis-examples/src/main/java/org/apache/ratis/examples/arithmetic/ArithmeticStateMachine.java#L82-L84
Will reset the lastAppliedIndex, I wonder if we can do this too? then the change will be simple

@leixm
Copy link
Contributor Author

leixm commented Jun 12, 2024

@@ -114,6 +98,7 @@ public void initialize(RaftServer server, RaftGroupId id, RaftStorage raftStorag
public void reinitialize() throws IOException {
LOG.info("Reinitializing state machine.");
getLifeCycle().compareAndTransition(PAUSED, STARTING);
storage.refreshLatestSnapshot();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this

final SingleFileSnapshotInfo s = latestSnapshot.get();
if (s != null) {
return s;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we can directly remove this three line

@AngersZhuuuu
Copy link
Contributor

Alluxio fix the same issue in Alluxio/alluxio#12181

Copy link
Contributor

@AngersZhuuuu AngersZhuuuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and pls check failed GA test

@leixm
Copy link
Contributor Author

leixm commented Jun 12, 2024

Seems flaky test which not caused by this PR.

Copy link
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Member

@SteNicholas SteNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. I left the minor comments for this pull request. BTW, could the copied class be removed after apache/ratis#1111 released?

@leixm
Copy link
Contributor Author

leixm commented Jun 13, 2024

Overall LGTM. I left the minor comments for this pull request. BTW, could the copied class be removed after apache/ratis#1111 released?

Of course it can be removed after apache/ratis#1111 released.

Copy link
Member

@SteNicholas SteNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@leixm
Copy link
Contributor Author

leixm commented Jun 13, 2024

Compile failed caused by CELEBORN-1337

@SteNicholas
Copy link
Member

@leixm, please rebase the latest main branch.

@leixm
Copy link
Contributor Author

leixm commented Jun 13, 2024

All done.

SteNicholas pushed a commit that referenced this pull request Jun 13, 2024
…stalling snapshot

### What changes were proposed in this pull request?
Fix Master follower node metadata is out of sync after installing snapshot

### Why are the changes needed?
Follower node metadata is out of sync, when a master-slave switchover occurs, there are major risks to the stability of the cluster.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
UT.

Closes #2547 from leixm/issue_1452.

Authored-by: Xianming Lei <[email protected]>
Signed-off-by: SteNicholas <[email protected]>
(cherry picked from commit cb30e91)
Signed-off-by: SteNicholas <[email protected]>
@SteNicholas
Copy link
Member

SteNicholas commented Jun 13, 2024

Thanks. Merged to main(v0.6.0) and branch-0.5(0.5.1)/branch-0.4(0.4.2).

cfmcgrady pushed a commit to cfmcgrady/incubator-celeborn that referenced this pull request Jun 13, 2024
…stalling snapshot

### What changes were proposed in this pull request?
Fix Master follower node metadata is out of sync after installing snapshot

### Why are the changes needed?
Follower node metadata is out of sync, when a master-slave switchover occurs, there are major risks to the stability of the cluster.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
UT.

Closes apache#2547 from leixm/issue_1452.

Authored-by: Xianming Lei <[email protected]>
Signed-off-by: SteNicholas <[email protected]>
RexXiong pushed a commit that referenced this pull request Jun 13, 2024
…er installing snapshot

### What changes were proposed in this pull request?

backport #2547 to `branch-0.4`

Fix Master follower node metadata is out of sync after installing snapshot

### Why are the changes needed?
Follower node metadata is out of sync, when a master-slave switchover occurs, there are major risks to the stability of the cluster.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
UT.

Closes #2563 from cfmcgrady/CELEBORN-1452-branch-0.4.

Lead-authored-by: Xianming Lei <[email protected]>
Co-authored-by: Fu Chen <[email protected]>
Signed-off-by: Shuang <[email protected]>
zhaohehuhu pushed a commit to zhaohehuhu/celeborn that referenced this pull request Jun 18, 2024
…stalling snapshot

### What changes were proposed in this pull request?
Fix Master follower node metadata is out of sync after installing snapshot

### Why are the changes needed?
Follower node metadata is out of sync, when a master-slave switchover occurs, there are major risks to the stability of the cluster.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
UT.

Closes apache#2547 from leixm/issue_1452.

Authored-by: Xianming Lei <[email protected]>
Signed-off-by: SteNicholas <[email protected]>
(cherry picked from commit cb30e91)
RexXiong pushed a commit that referenced this pull request Jul 11, 2024
### What changes were proposed in this pull request?

Bump Ratis version from 3.0.1 to 3.1.0. Meanwhile, remove `CelebornStateMachineStorage` with the release of apache/ratis#1111.

### Why are the changes needed?

Bump Ratis version from 3.0.1 to 3.1.0. Ratis has released v3.1.0, of which release note refers to [3.1.0](https://ratis.apache.org/post/3.1.0.html). The 3.1.0 version is a minor release with multiple improvements and bugfixes including [[RATIS-2111] Reinitialize should load the latest snapshot](https://issues.apache.org/jira/browse/RATIS-2111). See the [changes between 3.0.1 and 3.1.0](apache/ratis@ratis-3.0.1...ratis-3.1.0) releases.

Follow up #2547.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`MasterStateMachineSuiteJ#testInstallSnapshot`

Closes #2610 from SteNicholas/CELEBORN-1499.

Authored-by: SteNicholas <[email protected]>
Signed-off-by: Shuang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants