API: add hashcode cache in StructType #11764

wzx140 · 2024-12-12T12:45:29Z

singhpk234 · 2024-12-12T18:16:20Z

Q: does it completely mitigate the flatness observed ? can you please attach the flame graph now ?
Interesting find @wzx140

wzx140 · 2024-12-13T02:46:40Z

Q: does it completely mitigate the flatness observed ? can you please attach the flame graph now ? Interesting find @wzx140

@singhpk234

Performance Comparison After Adding Cache

Metric	Before Adding Cache	After Adding Cache
Pre-execution Preparation Time	154s	18s
Scan Spec Time	142s	5s

Pre-execution Preparation Time: the time interval from the first table load to the start of the first stage execution
Scan Spec Time: added a timer to the method SparkPartitioningAwareScan#specs

Flame Graph
Before Adding Cache: https://drive.google.com/file/d/1o68Q6n1c-BD7xwfM7ETO6fjbC3jjrlOr/view?usp=drive_link
After Adding Cache: https://drive.google.com/file/d/1YnGnEZ06Es7xs4fGVIZgnjDFS3L4cSR2/view?usp=drive_link

singhpk234 · 2024-12-14T01:03:05Z

api/src/main/java/org/apache/iceberg/types/Types.java

+      if (hashCode == NO_HASHCODE) {
+        hashCode = Objects.hash(NestedField.class, Arrays.hashCode(fields));
+      }
+      return hashCode;


can this have a multi-threaded access ? if yes can we have double check locking ? to avoid recompute

In this scenario, there is no multi-threaded access, but the method structType.hashCode might be accessed by multiple threads in other contexts.

I think the main purpose of this cache is to reduce a significant amount of redundant computation. Introducing additional complexity to completely avoid redundant computation might not be necessary, as even with multi-threaded access, the redundant computation would only occur a few times (up to the number of threads), which should be negligible.

Agreed, I don't think it's worth the complexity of double checked locking to avoid a little bit of redundant computation in the multi-threaded case.

singhpk234 · 2024-12-14T01:04:33Z

Pre-execution Preparation Time: the time interval from the first table load to the start of the first stage execution
Scan Spec Time: added a timer to the method SparkPartitioningAwareScan#specs

sounds really promising, thank you for sharing @wzx140

amogh-jahagirdar

Thanks @wzx140 this is a great finding! I'll hold before merging in case @aokolnychyi had any feedback.

amogh-jahagirdar · 2024-12-16T15:46:01Z

api/src/main/java/org/apache/iceberg/types/Types.java

+      if (hashCode == NO_HASHCODE) {
+        hashCode = Objects.hash(NestedField.class, Arrays.hashCode(fields));
+      }
+      return hashCode;


Agreed, I don't think it's worth the complexity of double checked locking to avoid a little bit of redundant computation in the multi-threaded case.

amogh-jahagirdar · 2024-12-16T15:49:06Z

api/src/main/java/org/apache/iceberg/types/Types.java

@@ -723,6 +723,9 @@ public int hashCode() {

  public static class StructType extends NestedType {
    private static final Joiner FIELD_SEP = Joiner.on(", ");
+    private static final int NO_HASHCODE = Integer.MIN_VALUE;
+
+    private transient int hashCode = NO_HASHCODE;


Thanks for making this transient, when I saw this PR that's the one aspect I wanted to make sure was the case. We really don't want any weird issues that will happen in a distributed setting where the cached hashcode on one struct type is different than the hashcode for the same struct type that's on a different node (which can easily happen since hashcodes across JVMS can be different)

Making it transient will avoid all those kinds of issues.

Uninitialized hashcode value

amogh-jahagirdar

Sorry for the confusion, still looks good to me, I thought I spotted an issue in how the default hashCode is initialized but not really worth addressing imo.

amogh-jahagirdar · 2024-12-16T16:02:34Z

api/src/main/java/org/apache/iceberg/types/Types.java

@@ -723,6 +723,9 @@ public int hashCode() {

  public static class StructType extends NestedType {
    private static final Joiner FIELD_SEP = Joiner.on(", ");
+    private static final int NO_HASHCODE = Integer.MIN_VALUE;


I was originally going to suggest we follow the same pattern we do in CharSequenceWrapper where we have two fields, the hashCode and a boolean hashIsZero flag. This way in case the hashCode is actually zero we don't recompute it.

In the current implementation, we would recompute the hashCode if it's actually Integer.MIN_VALUE but arguably not worth the complexity to handle that for this case.

API: add hashcode cache in StructType

f25c89b

github-actions bot added the API label Dec 12, 2024

Fokko requested a review from aokolnychyi December 13, 2024 08:10

singhpk234 reviewed Dec 14, 2024

View reviewed changes

amogh-jahagirdar previously approved these changes Dec 16, 2024

View reviewed changes

amogh-jahagirdar approved these changes Dec 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: add hashcode cache in StructType #11764

API: add hashcode cache in StructType #11764

wzx140 commented Dec 12, 2024

singhpk234 commented Dec 12, 2024

wzx140 commented Dec 13, 2024 •

edited

Loading

singhpk234 Dec 14, 2024

wzx140 Dec 15, 2024

amogh-jahagirdar Dec 16, 2024

singhpk234 commented Dec 14, 2024

amogh-jahagirdar left a comment

amogh-jahagirdar Dec 16, 2024

amogh-jahagirdar Dec 16, 2024

amogh-jahagirdar left a comment

amogh-jahagirdar Dec 16, 2024

API: add hashcode cache in StructType #11764

Are you sure you want to change the base?

API: add hashcode cache in StructType #11764

Conversation

wzx140 commented Dec 12, 2024

singhpk234 commented Dec 12, 2024

wzx140 commented Dec 13, 2024 • edited Loading

singhpk234 Dec 14, 2024

Choose a reason for hiding this comment

wzx140 Dec 15, 2024

Choose a reason for hiding this comment

amogh-jahagirdar Dec 16, 2024

Choose a reason for hiding this comment

singhpk234 commented Dec 14, 2024

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

amogh-jahagirdar Dec 16, 2024

Choose a reason for hiding this comment

amogh-jahagirdar Dec 16, 2024

Choose a reason for hiding this comment

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

amogh-jahagirdar Dec 16, 2024

Choose a reason for hiding this comment

wzx140 commented Dec 13, 2024 •

edited

Loading