Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RATIS-2132. Revert RATIS-2099 due to its performance regression #1126

Closed
wants to merge 1 commit into from

Conversation

duongkame
Copy link
Contributor

What changes were proposed in this pull request?

See RATIS-2132

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/RATIS-2132

Please replace this section with the link to the Apache JIRA)

How was this patch tested?

CI.

@duongkame
Copy link
Contributor Author

cc. @szetszwo @smengcl

@duongkame duongkame changed the title RATIS-2132. Revert RATIS-2099 due to the performance regression RATIS-2132. Revert RATIS-2099 due to its performance regression Jul 30, 2024
@szetszwo
Copy link
Contributor

@duongkame , instead of reverting this, how about replacing the guava Cache with a WeakHashMap?

diff --git a/ratis-server-api/src/main/java/org/apache/ratis/server/protocol/TermIndex.java b/ratis-server-api/src/main/java/org/apache/ratis/server/protocol/TermIndex.java
index a8aa670613..e3b8475fe0 100644
--- a/ratis-server-api/src/main/java/org/apache/ratis/server/protocol/TermIndex.java
+++ b/ratis-server-api/src/main/java/org/apache/ratis/server/protocol/TermIndex.java
@@ -19,23 +19,23 @@ package org.apache.ratis.server.protocol;
 
 import org.apache.ratis.proto.RaftProtos.LogEntryProto;
 import org.apache.ratis.proto.RaftProtos.TermIndexProto;
-import org.apache.ratis.thirdparty.com.google.common.cache.Cache;
-import org.apache.ratis.thirdparty.com.google.common.cache.CacheBuilder;
 
+import java.lang.ref.WeakReference;
 import java.util.Comparator;
 import java.util.Optional;
-import java.util.concurrent.ExecutionException;
-import java.util.concurrent.TimeUnit;
+import java.util.WeakHashMap;
 
 /** The term and the log index defined in the Raft consensus algorithm. */
 public interface TermIndex extends Comparable<TermIndex> {
   class Util {
     /** An LRU Cache for {@link TermIndex} instances */
-    private static final Cache<TermIndex, TermIndex> CACHE = CacheBuilder.newBuilder()
-          .maximumSize(1 << 16)
-          .expireAfterAccess(1, TimeUnit.MINUTES)
-          .build();
+    private static final WeakHashMap<TermIndex, WeakReference<TermIndex>> CACHE = new WeakHashMap<>();
+
+    private static synchronized TermIndex putIfAbsent(TermIndex termIndex) {
+      return CACHE.computeIfAbsent(termIndex, WeakReference::new).get();
+    }
   }
+
   TermIndex[] EMPTY_ARRAY = {};
 
   /** @return the term. */
@@ -109,10 +109,6 @@ public interface TermIndex extends Comparable<TermIndex> {
         return String.format("(t:%s, i:%s)", longToString(term), longToString(index));
       }
     };
-    try {
-      return Util.CACHE.get(key, () -> key);
-    } catch (ExecutionException e) {
-      throw new IllegalStateException("Failed to valueOf(" + term + ", " + index + "), key=" + key, e);
-    }
+    return Util.putIfAbsent(key);
   }
 }
\ No newline at end of file

@duongkame
Copy link
Contributor Author

@duongkame , instead of reverting this, how about replacing the guava Cache with a WeakHashMap?

diff --git a/ratis-server-api/src/main/java/org/apache/ratis/server/protocol/TermIndex.java b/ratis-server-api/src/main/java/org/apache/ratis/server/protocol/TermIndex.java
index a8aa670613..e3b8475fe0 100644
--- a/ratis-server-api/src/main/java/org/apache/ratis/server/protocol/TermIndex.java
+++ b/ratis-server-api/src/main/java/org/apache/ratis/server/protocol/TermIndex.java
@@ -19,23 +19,23 @@ package org.apache.ratis.server.protocol;
 
 import org.apache.ratis.proto.RaftProtos.LogEntryProto;
 import org.apache.ratis.proto.RaftProtos.TermIndexProto;
-import org.apache.ratis.thirdparty.com.google.common.cache.Cache;
-import org.apache.ratis.thirdparty.com.google.common.cache.CacheBuilder;
 
+import java.lang.ref.WeakReference;
 import java.util.Comparator;
 import java.util.Optional;
-import java.util.concurrent.ExecutionException;
-import java.util.concurrent.TimeUnit;
+import java.util.WeakHashMap;
 
 /** The term and the log index defined in the Raft consensus algorithm. */
 public interface TermIndex extends Comparable<TermIndex> {
   class Util {
     /** An LRU Cache for {@link TermIndex} instances */
-    private static final Cache<TermIndex, TermIndex> CACHE = CacheBuilder.newBuilder()
-          .maximumSize(1 << 16)
-          .expireAfterAccess(1, TimeUnit.MINUTES)
-          .build();
+    private static final WeakHashMap<TermIndex, WeakReference<TermIndex>> CACHE = new WeakHashMap<>();
+
+    private static synchronized TermIndex putIfAbsent(TermIndex termIndex) {
+      return CACHE.computeIfAbsent(termIndex, WeakReference::new).get();
+    }
   }
+
   TermIndex[] EMPTY_ARRAY = {};
 
   /** @return the term. */
@@ -109,10 +109,6 @@ public interface TermIndex extends Comparable<TermIndex> {
         return String.format("(t:%s, i:%s)", longToString(term), longToString(index));
       }
     };
-    try {
-      return Util.CACHE.get(key, () -> key);
-    } catch (ExecutionException e) {
-      throw new IllegalStateException("Failed to valueOf(" + term + ", " + index + "), key=" + key, e);
-    }
+    return Util.putIfAbsent(key);
   }
 }
\ No newline at end of file

I don't think we need any cache here. @szetszwo . You already create the TermIndex instance and throw it away to favor the one in the cache. How does it help?

@szetszwo
Copy link
Contributor

As reported by @jojochuang in RATIS-2099, there were a lot of TermIndex$1 objects,

It seems there are multiple objects created for the same value of TermIndex. If we have a cache, we make sure there is only one object for one value.

You already create the TermIndex instance ...

Since the instance is short live, it will be gc'ed as young gen instead of promoting to old gen.

@symious
Copy link
Contributor

symious commented Jul 31, 2024

I think the problem at the first place is there are too many annoymous classes not being GCed, not sure if it will be better if we use normal class instead of annoymous class.

@szetszwo
Copy link
Contributor

... it will be better if we use normal class instead of annoymous class.

It won't help. They are the same in the sense of gc. Anonymous classes just don't have a name.

@duongkame
Copy link
Contributor Author

As reported by @jojochuang in RATIS-2099, there were a lot of TermIndex$1 objects, It seems there are multiple objects created for the same value of TermIndex. If we have a cache, we make sure there is only one object for one value.

You already create the TermIndex instance ...

Since the instance is short live, it will be gc'ed as young gen instead of promoting to old gen.

We need to identify the source of them. I believe the TermIndex instances we cache are in the RaftLog cache. And in that cache, TermIndex (term, index) instances are unique. How does deduplication/interning help to reduce the number of cached TermIndex instances?

@szetszwo
Copy link
Contributor

... TermIndex (term, index) instances are unique. How does deduplication/interning help to reduce the number of cached TermIndex instances?

TermIndex objects are created by the valueOf(..) methods. In the code, we definitely are calling the valueOf(..) methods multiple times for the same term and index. As you mentioned, some of the TermIndex objects are stored in cache so they are not gc'ed. We may have stored the same TermIndex value multiple times. Otherwise, we won't get 2.4m TermIndex objects in #1126 (comment)

@jojochuang , would you be able to reproduce it? It would be great if we can test the deduplication.

@symious
Copy link
Contributor

symious commented Aug 1, 2024

I tried to analyze the a local heap, but failed to reproduce the result mentioned by @jojochuang .

Class Name                                                                         | Objects | Shallow Heap | Retained Heap
----------------------------------------------------------------------------------------------------------------------------
org.apache.logging.log4j.core.async.RingBufferLogEvent                             | 262,144 |   27,262,976 | >= 41,943,032
org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor$Log4jEventWrapper   | 262,144 |    6,291,456 |  >= 6,291,456
org.apache.logging.log4j.core.time.MutableInstant                                  | 262,144 |    6,291,456 |  >= 6,291,512
org.apache.logging.log4j.util.SortedArrayStringMap                                 | 262,143 |    8,388,576 |  >= 8,389,096
byte[]                                                                             | 208,852 |   24,340,320 | >= 24,340,320
org.apache.ratis.proto.RaftProtos$CommitInfoProto                                  | 180,004 |    8,640,192 |  >= 8,641,432
java.util.concurrent.ConcurrentHashMap$Node                                        | 161,219 |    5,159,008 | >= 41,160,136
java.util.concurrent.atomic.AtomicInteger                                          | 120,145 |    1,922,320 |  >= 1,922,368
org.apache.ratis.thirdparty.com.google.protobuf.ByteString$LiteralByteString       | 120,023 |    2,880,552 | >= 15,601,672
org.apache.ratis.proto.RaftProtos$LogEntryProto                                    | 119,312 |    6,681,472 | >= 29,214,712
org.apache.ratis.server.protocol.TermIndex$1                                       | 119,311 |    3,817,952 |  >= 3,818,040
org.apache.ratis.util.ReferenceCountedObject$3                                     | 119,311 |    3,817,952 | >= 34,942,304
org.apache.ratis.server.raftlog.LogEntryHeader$1                                   | 119,311 |    2,863,464 |  >= 2,863,520
org.apache.ratis.server.raftlog.segmented.LogSegment$LogRecord                     | 119,311 |    2,863,464 |  >= 5,726,928
org.apache.ratis.thirdparty.com.google.common.cache.LocalCache$StrongValueReference| 118,544 |    1,896,704 | >= 34,296,768
java.lang.String                                                                   |  85,515 |    2,052,360 |  >= 6,972,128
java.lang.Object[]                                                                 |  70,952 |   16,363,352 | >= 85,756,368
----------------------------------------------------------------------------------------------------------------------------

I think the following points are interesting:

  1. The count of TermIndex$1 and RaftProtos$LogEntryProto are similar here, but quite different in @jojochuang 's case.
  2. The count of TermIndex$1 will be very large, so the default size of Util.CACHE, which is 1 << 16, is not enough, so the cache hit rate should be quite low.

Tested by running "ozone freon ommg --operation CREATE_FILE -n 30000 -t 1 --size=100 --bucket=bucket-1 --type RATIS --replication THREE" on master branch of Ozone, heap analyzed from datanode.

@duongkame
Copy link
Contributor Author

duongkame commented Aug 26, 2024

We may have stored the same TermIndex value multiple times. Otherwise, we won't get 2.4m TermIndex objects in #1126 (comment)

Let's just revert the change by this PR and reexamine the problem with TermIndex instances somewhere else.

The change is made prematurely, without due diligence. For such memory, we must.

  1. Identify the problem root, e.g. is there duplicated, and where is the local source of keeping that huge amount of TermIndex instances? Is there anything wrong with logic? It's suspecting because we're only cache TermIndex in the RaftLogCache, where TermIndex should be unique.
    This doesn't sound right not to understand the problem and create a global cache with the hope to fix it. As per my experience, a global static cache usually brings more problems than solving them.
  2. Make a quick fix and verify if the change fixes the original problem.

Without doing the above 2 things, we will likely introduce a new problem without solving the original, just like we did. Guessing and fixing and not verifying never turns out well.

@szetszwo @jojochuang

@jojochuang
Copy link
Contributor

+1 to revert. IMO the initial problem in the original jira wasn't as a huge issue as the problem it introduces.

@szetszwo
Copy link
Contributor

@jojochuang , thanks for the confirmation. Let's revert it.

@duongkame , let's use the git revert command for reverting.

@szetszwo
Copy link
Contributor

Reverted.

commit 520ecab157c9c6aff87ed5c5978c08f98cd4ec6c (origin/master, origin/HEAD, master)
Author: Tsz-Wo Nicholas Sze <[email protected]>
Date:   Mon Aug 26 11:07:29 2024 -0700

    Revert "RATIS-2099. Cache TermIndexImpl instead of using anonymous class (#1100)"
    
    This reverts commit 428ce4ae3d5a0349f3425cb85ef1a3d38dea24b1.
commit da5d508caffc4ca90b0ab962b5105785b9774daa
Author: Tsz-Wo Nicholas Sze <[email protected]>
Date:   Mon Aug 26 11:07:17 2024 -0700

    Revert "RATIS-2101. Move TermIndex.PRIVATE_CACHE to Util.CACHE (#1103)"
    
    This reverts commit 93eb32a8620fdd4e5119592ef32bc50590810c7b.

@szetszwo szetszwo closed this Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants