Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RATIS-2147. Md5 mismatch when snapshot install #1142

Merged
merged 8 commits into from
Sep 14, 2024

Conversation

133tosakarin
Copy link
Contributor

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@133tosakarin , thanks a lot for working on this! Please remove the unused code instead of commenting them out. See also the comment inlined and below

diff --git a/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java b/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
index 794604d66b..88a7555dcb 100644
--- a/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
+++ b/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
@@ -27,7 +27,6 @@ import org.apache.ratis.statemachine.SnapshotInfo;
 import org.apache.ratis.statemachine.StateMachine;
 import org.apache.ratis.statemachine.StateMachineStorage;
 import org.apache.ratis.util.FileUtils;
-import org.apache.ratis.util.JavaUtils;
 import org.apache.ratis.util.MD5FileUtil;
 import org.apache.ratis.util.MemoizedSupplier;
 import org.apache.ratis.util.Preconditions;
@@ -63,7 +62,6 @@ public class SnapshotManager {
   private final Supplier<File> snapshotDir;
   private final Supplier<File> snapshotTmpDir;
   private final Function<FileChunkProto, String> getRelativePath;
-  private final Supplier<MessageDigest> digester = JavaUtils.memoize(MD5Hash::getDigester);
 
   SnapshotManager(RaftPeerId selfId, Supplier<RaftStorageDirectory> dir, StateMachineStorage smStorage) {
     this.selfId = selfId;
@@ -88,7 +86,6 @@ public class SnapshotManager {
       }
       // create the temp snapshot file and put padding inside
       out = FileUtils.newFileChannel(tmpSnapshotFile, StandardOpenOption.WRITE, StandardOpenOption.CREATE);
-      digester.get().reset();
     } else {
       if (!exists) {
         throw new FileNotFoundException("Chunk offset is non-zero but file is not found: " + tmpSnapshotFile
@@ -114,7 +111,7 @@ public class SnapshotManager {
 
     // TODO: Make sure that subsequent requests for the same installSnapshot are coming in order,
     // and are not lost when whole request cycle is done. Check requestId and requestIndex here
-
+    final MessageDigest digester = MD5Hash.getDigester();
     for (FileChunkProto chunk : snapshotChunkRequest.getFileChunksList()) {
       SnapshotInfo pi = stateMachine.getLatestSnapshot();
       if (pi != null && pi.getTermIndex().getIndex() >= lastIncludedIndex) {
@@ -128,7 +125,7 @@ public class SnapshotManager {
 
       try (FileChannel out = open(chunk, tmpSnapshotFile)) {
         final ByteBuffer data = chunk.getData().asReadOnlyByteBuffer();
-        digester.get().update(data.duplicate());
+        digester.update(data.duplicate());
 
         int written = 0;
         for(; data.remaining() > 0; ) {
@@ -144,7 +141,7 @@ public class SnapshotManager {
             new MD5Hash(chunk.getFileDigest().toByteArray());
         // calculate the checksum of the snapshot file and compare it with the
         // file digest in the request
-        final MD5Hash digest = new MD5Hash(digester.get().digest());
+        final MD5Hash digest = new MD5Hash(digester.digest());
         if (!digest.equals(expectedDigest)) {
           LOG.warn("The snapshot md5 digest {} does not match expected {}",
               digest, expectedDigest);

Comment on lines 118 to 119
MessageDigest digester = newMd5Digest();
digester.reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use MD5Hash.getDigester().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I am not sure if it is a problem with JavaUtils.memoize, but I can confirm that it is a problem with digest, my idea is to use a new digest each time, which is more secure and reliable, and also avoids the situation where the digest calculation buffer is destroyed and the final result is incorrect

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MD5Hash.getDigester() is safe since it is a Threadlocal and reset the digest every time, i.e. the same thread will always use the same digest object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ThreadLocal variables should be safe. However, we have stably reproduced this issue in multiple different testing environments, and we are also identifying whether there is a problem with the JDK version we are using. During this process, we have made an important discovery that even if we receive an empty file, Corrupt will still appear.

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. For an empty file, we probably need to handle it differently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well! I am testing in our environment to see if there are still problems.

Before that, I will convert this PR to draft

Copy link
Contributor

@ivandika3 ivandika3 Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just taking a quick look. I think simply removing the MemoizedSupplier (just calling MD5Hash#getDigester every time) should technically work. From what I see, MemoizedSupplier only call the MessageDigest#reset at the first invocation. The subsequent MemoizedSupplier invocation will just return the MessageDigest instance without calling the MessageDigest#reset which might cause the mismatch issue.

Ozone encountered some thread local MD5 issue (apache/ozone#6435), might be related.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just taking a quick look. I think simply removing the MemoizedSupplier (just calling MD5Hash#getDigester every time) should technically work. From what I see, MemoizedSupplier only call the MessageDigest#reset at the first invocation. The subsequent MemoizedSupplier invocation will just return the MessageDigest instance without calling the MessageDigest#reset which might cause the mismatch issue.

Ozone encountered some thread local MD5 issue (apache/ozone#6435), might be related.

Thank you for your comment!

I also think that whether to remove MemoizedSupplier or not is irrelevant to the problem. My idea is to use a new digester for each file to solve this problem well.

I have pushed the code to this PR, you can take a look

Copy link
Contributor

@ivandika3 ivandika3 Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, instantiating a new digester should also be fine, but there might be more memory overhead and short lived objects compared to the ThreadLocal implementation (which might also be fine if the operation is infrequent).

If the issue can be resolved by simply removing the MemoizedSupplier, I think we can consider this instead of instantiating a new digester every time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just taking a quick look. I think simply removing the MemoizedSupplier (just calling MD5Hash#getDigester every time) should technically work. From what I see, MemoizedSupplier only call the MessageDigest#reset at the first invocation. The subsequent MemoizedSupplier invocation will just return the MessageDigest instance without calling the MessageDigest#reset which might cause the mismatch issue.

Ozone encountered some thread local MD5 issue (apache/ozone#6435), might be related.

Just taking a quick look. I think simply removing the MemoizedSupplier (just calling MD5Hash#getDigester every time) should technically work. From what I see, MemoizedSupplier only call the MessageDigest#reset at the first invocation. The subsequent MemoizedSupplier invocation will just return the MessageDigest instance without calling the MessageDigest#reset which might cause the mismatch issue.

image In fact, digester.get ().reset () will be executed every time the first chunk is received.

In addition, I have done JMH testing, and it takes 0.5s for new 1000000 digests, but in fact, we do not send so many files so frequently, and this overhead is not a problem

@szetszwo szetszwo changed the title RATIS-2147.MD5 mismatch when accept snapshot RATIS-2147. MD5 mismatch when accept snapshot Sep 3, 2024
@133tosakarin 133tosakarin changed the title RATIS-2147. MD5 mismatch when accept snapshot [WIP]RATIS-2147. MD5 mismatch when accept snapshot Sep 4, 2024
@133tosakarin 133tosakarin marked this pull request as draft September 4, 2024 09:56
@133tosakarin 133tosakarin changed the title [WIP]RATIS-2147. MD5 mismatch when accept snapshot Demo md5 mismatch when snapshot install Sep 4, 2024
@OneSizeFitsQuorum
Copy link
Contributor

We spent a lot of time troubleshooting this issue, and what we were able to confirm in the end is that there seems to be some problem with digester.reset, which causes the final MD5 calculation to be incorrect. We searched for related issues on the OpenJDK official website but couldn't find an exact cause.

However, we found that instead of using reset, creating a new digester each time a new file appears should solve the issue (our testing process can verify this). Therefore, I recommend making this change directly.

What's your opinion? @szetszwo

@133tosakarin
Copy link
Contributor Author

@szetszwo What do you think of this way?

@133tosakarin 133tosakarin marked this pull request as ready for review September 10, 2024 03:32
@133tosakarin 133tosakarin changed the title Demo md5 mismatch when snapshot install RATIS-2147. Md5 mismatch when snapshot install Sep 10, 2024
@133tosakarin
Copy link
Contributor Author

private String randomString(int length) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < length; i++) {
sb.append((char) (Math.random() * 26 + 'a'));
}
return sb.toString();
}

@Benchmark
public void md5Test(Blackhole blackhole) throws NoSuchAlgorithmException {
    int length = 1000;
    md5 = MessageDigest.getInstance("MD5");
    for (int i = 0; i < 1000; ++i) {
        md5.reset();
        md5.update(randomString(length).getBytes());
        blackhole.consume(md5.digest());
        length += 100;
    }
}

@Benchmark
public void md5Test2(Blackhole blackhole) throws NoSuchAlgorithmException {
    int length = 10000;
    for (int i = 0; i < 1000; ++i) {
        MessageDigest md5 = MessageDigest.getInstance("MD5");
        md5.reset();
        md5.update(randomString(length).getBytes());
        blackhole.consume(md5.digest());
        length += 100;
    }

}

@Benchmark
public void md5DoNothingTest() throws NoSuchAlgorithmException {
    MessageDigest md5 = MessageDigest.getInstance("MD5");
    for (int i = 0; i < 1000000; ++i) {
        md5.reset();
    }
}

@Benchmark
public void md5DoNothingTest2() throws NoSuchAlgorithmException {
    for (int i = 0; i < 1000000; ++i) {
        MessageDigest md5 = MessageDigest.getInstance("MD5");
        md5.reset();
    }
}

@133tosakarin
Copy link
Contributor Author

133tosakarin commented Sep 10, 2024

image

This is the performance difference between using a new digester and using a single digester

For each received file, the difference between using a new md5 object and using a member variable is almost negligible

Comment on lines 177 to 178
digester = MessageDigest.getInstance("MD5");
digester.reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think calling MessageDigest#reset here is unnecessary since a newly instantiated MessageDigest should already be "reset".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your reminder

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@133tosakarin , thanks for working on this!

Could we covert the digester to a local variable?

diff --git a/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java b/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
index 794604d66b..6f36757189 100644
--- a/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
+++ b/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
@@ -27,7 +27,6 @@ import org.apache.ratis.statemachine.SnapshotInfo;
 import org.apache.ratis.statemachine.StateMachine;
 import org.apache.ratis.statemachine.StateMachineStorage;
 import org.apache.ratis.util.FileUtils;
-import org.apache.ratis.util.JavaUtils;
 import org.apache.ratis.util.MD5FileUtil;
 import org.apache.ratis.util.MemoizedSupplier;
 import org.apache.ratis.util.Preconditions;
@@ -63,7 +62,6 @@ public class SnapshotManager {
   private final Supplier<File> snapshotDir;
   private final Supplier<File> snapshotTmpDir;
   private final Function<FileChunkProto, String> getRelativePath;
-  private final Supplier<MessageDigest> digester = JavaUtils.memoize(MD5Hash::getDigester);
 
   SnapshotManager(RaftPeerId selfId, Supplier<RaftStorageDirectory> dir, StateMachineStorage smStorage) {
     this.selfId = selfId;
@@ -88,7 +86,6 @@ public class SnapshotManager {
       }
       // create the temp snapshot file and put padding inside
       out = FileUtils.newFileChannel(tmpSnapshotFile, StandardOpenOption.WRITE, StandardOpenOption.CREATE);
-      digester.get().reset();
     } else {
       if (!exists) {
         throw new FileNotFoundException("Chunk offset is non-zero but file is not found: " + tmpSnapshotFile
@@ -112,9 +109,9 @@ public class SnapshotManager {
     LOG.info("Installing snapshot:{}, to tmp dir:{}",
         ServerStringUtils.toInstallSnapshotRequestString(request), tmpDir);
 
+    final MessageDigest digester = MD5Hash.getDigester();
     // TODO: Make sure that subsequent requests for the same installSnapshot are coming in order,
     // and are not lost when whole request cycle is done. Check requestId and requestIndex here
-
     for (FileChunkProto chunk : snapshotChunkRequest.getFileChunksList()) {
       SnapshotInfo pi = stateMachine.getLatestSnapshot();
       if (pi != null && pi.getTermIndex().getIndex() >= lastIncludedIndex) {
@@ -128,7 +125,7 @@ public class SnapshotManager {
 
       try (FileChannel out = open(chunk, tmpSnapshotFile)) {
         final ByteBuffer data = chunk.getData().asReadOnlyByteBuffer();
-        digester.get().update(data.duplicate());
+        digester.update(data.duplicate());
 
         int written = 0;
         for(; data.remaining() > 0; ) {
@@ -144,7 +141,7 @@ public class SnapshotManager {
             new MD5Hash(chunk.getFileDigest().toByteArray());
         // calculate the checksum of the snapshot file and compare it with the
         // file digest in the request
-        final MD5Hash digest = new MD5Hash(digester.get().digest());
+        final MD5Hash digest = new MD5Hash(digester.digest());
         if (!digest.equals(expectedDigest)) {
           LOG.warn("The snapshot md5 digest {} does not match expected {}",
               digest, expectedDigest);

@@ -63,7 +64,7 @@ public class SnapshotManager {
private final Supplier<File> snapshotDir;
private final Supplier<File> snapshotTmpDir;
private final Function<FileChunkProto, String> getRelativePath;
private final Supplier<MessageDigest> digester = JavaUtils.memoize(MD5Hash::getDigester);
private MessageDigest digester;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move it to a local variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained it in the code snippet you provided, please take a look.

@szetszwo
Copy link
Contributor

szetszwo commented Sep 12, 2024

For each received file, the difference between using a new md5 object and using a member variable is almost negligible

Thanks for running the benchmarks! Let's use a new digester. How about moving the new MD5 method to MD5Hash?

diff --git a/ratis-common/src/main/java/org/apache/ratis/io/MD5Hash.java b/ratis-common/src/main/java/org/apache/ratis/io/MD5Hash.java
index e60bef9652..8aacd9a65d 100644
--- a/ratis-common/src/main/java/org/apache/ratis/io/MD5Hash.java
+++ b/ratis-common/src/main/java/org/apache/ratis/io/MD5Hash.java
@@ -29,14 +29,15 @@ import java.util.Arrays;
 public class MD5Hash {
   public static final int MD5_LEN = 16;
 
-  private static final ThreadLocal<MessageDigest> DIGESTER_FACTORY =
-      ThreadLocal.withInitial(() -> {
+  private static final ThreadLocal<MessageDigest> DIGESTER_FACTORY = ThreadLocal.withInitial(MD5Hash::newDigester);
+
+  public static MessageDigest newDigester() {
     try {
       return MessageDigest.getInstance("MD5");
     } catch (NoSuchAlgorithmException e) {
-      throw new RuntimeException(e);
+      throw new IllegalStateException("Failed to create MessageDigest for MD5", e);
     }
-  });
+  }
 
   private byte[] digest;
 
diff --git a/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java b/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
index 794604d66b..5b5158aebd 100644
--- a/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
+++ b/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
@@ -112,9 +109,9 @@ public class SnapshotManager {
     LOG.info("Installing snapshot:{}, to tmp dir:{}",
         ServerStringUtils.toInstallSnapshotRequestString(request), tmpDir);
 
+    final MessageDigest digester = MD5Hash.newDigester();
     // TODO: Make sure that subsequent requests for the same installSnapshot are coming in order,
     // and are not lost when whole request cycle is done. Check requestId and requestIndex here
-
     for (FileChunkProto chunk : snapshotChunkRequest.getFileChunksList()) {
       SnapshotInfo pi = stateMachine.getLatestSnapshot();
       if (pi != null && pi.getTermIndex().getIndex() >= lastIncludedIndex) {

@133tosakarin
Copy link
Contributor Author

For each received file, the difference between using a new md5 object and using a member variable is almost negligible

Thanks for running the benchmarks! Let's use a new digester. How about moving the new MD5 method to MD5Hash?

diff --git a/ratis-common/src/main/java/org/apache/ratis/io/MD5Hash.java b/ratis-common/src/main/java/org/apache/ratis/io/MD5Hash.java
index e60bef9652..8aacd9a65d 100644
--- a/ratis-common/src/main/java/org/apache/ratis/io/MD5Hash.java
+++ b/ratis-common/src/main/java/org/apache/ratis/io/MD5Hash.java
@@ -29,14 +29,15 @@ import java.util.Arrays;
 public class MD5Hash {
   public static final int MD5_LEN = 16;
 
-  private static final ThreadLocal<MessageDigest> DIGESTER_FACTORY =
-      ThreadLocal.withInitial(() -> {
+  private static final ThreadLocal<MessageDigest> DIGESTER_FACTORY = ThreadLocal.withInitial(MD5Hash::newDigester);
+
+  public static MessageDigest newDigester() {
     try {
       return MessageDigest.getInstance("MD5");
     } catch (NoSuchAlgorithmException e) {
-      throw new RuntimeException(e);
+      throw new IllegalStateException("Failed to create MessageDigest for MD5", e);
     }
-  });
+  }
 
   private byte[] digest;
 
diff --git a/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java b/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
index 794604d66b..5b5158aebd 100644
--- a/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
+++ b/ratis-server/src/main/java/org/apache/ratis/server/storage/SnapshotManager.java
@@ -112,9 +109,9 @@ public class SnapshotManager {
     LOG.info("Installing snapshot:{}, to tmp dir:{}",
         ServerStringUtils.toInstallSnapshotRequestString(request), tmpDir);
 
+    final MessageDigest digester = MD5Hash.newDigester();
     // TODO: Make sure that subsequent requests for the same installSnapshot are coming in order,
     // and are not lost when whole request cycle is done. Check requestId and requestIndex here
-
     for (FileChunkProto chunk : snapshotChunkRequest.getFileChunksList()) {
       SnapshotInfo pi = stateMachine.getLatestSnapshot();
       if (pi != null && pi.getTermIndex().getIndex() >= lastIncludedIndex) {

Good idea, but there are some problems in the provided code, we should construct digester when receiving the first chunk and calculate md5 after receiving a complete file.

Therefore, we add a MessageDigest member variable in SnapshotManager and assign it to the first chunk every time

Copy link
Contributor

@OneSizeFitsQuorum OneSizeFitsQuorum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@133tosakarin , thanks for the update! The change looks good. Could you fix the checkstyle issues?

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

@OneSizeFitsQuorum OneSizeFitsQuorum merged commit 0c9e299 into apache:master Sep 14, 2024
11 of 12 checks passed
@OneSizeFitsQuorum
Copy link
Contributor

Maybe we need a atomicReference for this digester? @szetszwo

@szetszwo
Copy link
Contributor

@OneSizeFitsQuorum , That's a good point! We should have an AtomicReference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants