Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RATIS-2172. RaftServer may lose FollowerState #1166

Merged
merged 10 commits into from
Oct 11, 2024

Conversation

133tosakarin
Copy link
Contributor

What changes were proposed in this pull request?

Move server.isRunning () check from LeaderElection to FollowerState to ensure that the server is ready when it becomes a candidate

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/RATIS-2172

@133tosakarin
Copy link
Contributor Author

@szetszwo
Please review this PR

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@133tosakarin , thanks a lot for working on this!

The code change looks good.

For the test, we already have a CodeInjectionForTesting mechanism. Please use it instead.

+++ b/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java
@@ -161,6 +161,7 @@ class RaftServerImpl implements RaftServer.Division,
   static final String APPEND_TRANSACTION = CLASS_NAME + ".appendTransaction";
   static final String LOG_SYNC = APPEND_ENTRIES + ".logComplete";
   static final String START_LEADER_ELECTION = CLASS_NAME + ".startLeaderElection";
+  static final String START_COMPLETE = CLASS_NAME + ".startComplete";
 
   class Info implements DivisionInfo {
     @Override
@@ -401,6 +402,7 @@ class RaftServerImpl implements RaftServer.Division,
 
     jmxAdapter.registerMBean();
     state.start();
+    CodeInjectionForTesting.execute(START_COMPLETE, getId(), null, role);
     startComplete.compareAndSet(false, true);
     return true;
   }

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@133tosakarin , thanks for the update! Please see the comments inlined.

@@ -1921,4 +1923,5 @@ void onGroupLeaderElected() {
boolean isRunning() {
return startComplete.get() && lifeCycle.getCurrentState() == State.RUNNING;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this whitespace change.

Comment on lines 349 to 350
startServers(servers.values());

startServers(servers.values());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this whitespace change.

Comment on lines 405 to 406
CodeInjectionForTesting.execute(START_COMPLETE, getId(), null, role);
startComplete.compareAndSet(false, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a log message:

    if (startComplete.compareAndSet(false, true)) {
      LOG.info("{}: Successfully started.", getMemberId());
    }

Comment on lines 122 to 134
@Test
public void testWaitServerReady() throws Exception {
LOG.info("Running testWaitServerReady");
int []arr = {1, 2, 3, 4, 5};
for (int i : arr) {
final MiniRaftCluster cluster = newCluster(1);
CodeInjectionForTesting.put(RaftServerImpl.START_COMPLETE, new SleepCode(1000));
cluster.start();
// Leader will be elected if the server is ready
Assertions.assertTrue(RaftTestUtil.waitForLeader(cluster).getId() != null);
cluster.shutdown();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing the bug, I suggest the following change to check the timestamps:

  @Test
  public void testWaitServerReady() throws Exception {
    final int sleepMs = 1000 + ThreadLocalRandom.current().nextInt(1000);
    LOG.info("Running testWaitServerReady, sleep = {}ms", sleepMs);
    CodeInjectionForTesting.put(RaftServerImpl.START_COMPLETE, new SleepCode(sleepMs));
    final MiniRaftCluster cluster = newCluster(1);
    final Timestamp startTime = Timestamp.currentTime();
    cluster.start();
    LOG.info("Cluster started at {}ms", startTime.elapsedTimeMs());
    final RaftGroupId groupId = cluster.getGroupId();
    final RaftServerImpl server = (RaftServerImpl) cluster.getServers().iterator().next().getDivision(groupId);
    final boolean isRunning = server.isRunning();
    LOG.info("{} isRunning at {}ms? {}", server.getId(), startTime.elapsedTimeMs(), isRunning);

    // Leader will be elected if the server is ready
    Assertions.assertNotNull(waitForLeader(cluster), "No leader is elected.");
    final long elapsedMs = startTime.elapsedTimeMs();
    // allow a small difference to tolerate system timer inaccuracy
    Assertions.assertTrue(elapsedMs > sleepMs - 10, () -> "elapseMs = " + elapsedMs + " but sleepMs = " + sleepMs);
    cluster.shutdown();
    CodeInjectionForTesting.remove(RaftServerImpl.START_COMPLETE);
  }

With the new log message, we will see a lot of "first-election.timeout" logs and then "Successfully started". It seems good enough.

2024-10-11 09:12:52,420 [s0-impl-thread1] INFO  util.CodeInjectionForTesting (LeaderElectionTests.java:execute(112)) - s0: Simulate RaftServer startup blocking, sleep 1686ms
2024-10-11 09:12:52,420 [s0-impl-thread1] INFO  util.CodeInjectionForTesting (LeaderElectionTests.java:execute(112)) - s0: Simulate RaftServer startup blocking, sleep 1686ms
2024-10-11 09:12:52,571 [s0@group-C119E77DBEA6-FollowerState] INFO  server.RaftServerConfigKeys (ConfUtils.java:logFallback(53)) - raft.server.rpc.first-election.timeout.min = 150ms (fallback to raft.server.rpc.timeout.min)
2024-10-11 09:12:52,571 [s0@group-C119E77DBEA6-FollowerState] INFO  server.RaftServerConfigKeys (ConfUtils.java:logFallback(53)) - raft.server.rpc.first-election.timeout.max = 300ms (fallback to raft.server.rpc.timeout.max)
2024-10-11 09:12:52,758 [s0@group-C119E77DBEA6-FollowerState] INFO  server.RaftServerConfigKeys (ConfUtils.java:logFallback(53)) - raft.server.rpc.first-election.timeout.min = 150ms (fallback to raft.server.rpc.timeout.min)
2024-10-11 09:12:52,758 [s0@group-C119E77DBEA6-FollowerState] INFO  server.RaftServerConfigKeys (ConfUtils.java:logFallback(53)) - raft.server.rpc.first-election.timeout.max = 300ms (fallback to raft.server.rpc.timeout.max)
...
2024-10-11 09:12:53,925 [s0@group-C119E77DBEA6-FollowerState] INFO  server.RaftServerConfigKeys (ConfUtils.java:logFallback(53)) - raft.server.rpc.first-election.timeout.min = 150ms (fallback to raft.server.rpc.timeout.min)
2024-10-11 09:12:53,925 [s0@group-C119E77DBEA6-FollowerState] INFO  server.RaftServerConfigKeys (ConfUtils.java:logFallback(53)) - raft.server.rpc.first-election.timeout.max = 300ms (fallback to raft.server.rpc.timeout.max)
2024-10-11 09:12:54,107 [s0-impl-thread1] INFO  server.RaftServer$Division (RaftServerImpl.java:start(407)) - s0@group-C119E77DBEA6: Successfully started.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice provided

}

@Test
public void testAddServerForWaitReady() throws IOException, InterruptedException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems testing similar thing as the first test. Let's remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to test to ensure that new members join and execute normally

int []arr = {1, 2, 3, 4, 5};
for (int i : arr) {
final MiniRaftCluster cluster = newCluster(1);
CodeInjectionForTesting.put(RaftServerImpl.START_COMPLETE, new SleepCode(1000));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let add a CodeInjectionForTesting.remove method. Otherwise, it will affect other tests.

+++ b/ratis-common/src/main/java/org/apache/ratis/util/CodeInjectionForTesting.java
@@ -55,6 +55,12 @@ public final class CodeInjectionForTesting {
     INJECTION_POINTS.put(injectionPoint, code);
   }
 
+  /** Remove an injection point. */
+  public static void remove(String injectionPoint) {
+    final Code removed = INJECTION_POINTS.remove(injectionPoint);
+    LOG.debug("remove({}): {}", injectionPoint, removed);
+  }
+
   /** Execute the injected code, if there is any. */
   public static boolean execute(String injectionPoint, Object localId,

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@133tosakarin , thanks for the update! Please see the comments inlined.

@@ -347,7 +347,6 @@ public void start() throws IOException {

initServers();
startServers(servers.values());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There still a whitespace change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@133tosakarin , please take a look the file diff, there are two whitespace changes now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um, I'm sorry, I haven't had time to modify it yet

Assertions.assertTrue(reply.isSuccess());
}
// add 3 new servers and wait longer time
CodeInjectionForTesting.put(RaftServerImpl.START_COMPLETE, new SleepCode(2000));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I fine to keep this test. Then, we need to add CodeInjectionForTesting.remove at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

@szetszwo szetszwo merged commit 62ae6d9 into apache:master Oct 11, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants