-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: add hashcode cache in StructType #11764
base: main
Are you sure you want to change the base?
Conversation
Q: does it completely mitigate the flatness observed ? can you please attach the flame graph now ? |
Performance Comparison After Adding Cache
Pre-execution Preparation Time: the time interval from the first table load to the start of the first stage execution Flame Graph |
if (hashCode == NO_HASHCODE) { | ||
hashCode = Objects.hash(NestedField.class, Arrays.hashCode(fields)); | ||
} | ||
return hashCode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this have a multi-threaded access ? if yes can we have double check locking ? to avoid recompute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this scenario, there is no multi-threaded access, but the method structType.hashCode
might be accessed by multiple threads in other contexts.
I think the main purpose of this cache is to reduce a significant amount of redundant computation. Introducing additional complexity to completely avoid redundant computation might not be necessary, as even with multi-threaded access, the redundant computation would only occur a few times (up to the number of threads), which should be negligible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I don't think it's worth the complexity of double checked locking to avoid a little bit of redundant computation in the multi-threaded case.
sounds really promising, thank you for sharing @wzx140 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wzx140 this is a great finding! I'll hold before merging in case @aokolnychyi had any feedback.
if (hashCode == NO_HASHCODE) { | ||
hashCode = Objects.hash(NestedField.class, Arrays.hashCode(fields)); | ||
} | ||
return hashCode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I don't think it's worth the complexity of double checked locking to avoid a little bit of redundant computation in the multi-threaded case.
@@ -723,6 +723,9 @@ public int hashCode() { | |||
|
|||
public static class StructType extends NestedType { | |||
private static final Joiner FIELD_SEP = Joiner.on(", "); | |||
private static final int NO_HASHCODE = Integer.MIN_VALUE; | |||
|
|||
private transient int hashCode = NO_HASHCODE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this transient, when I saw this PR that's the one aspect I wanted to make sure was the case. We really don't want any weird issues that will happen in a distributed setting where the cached hashcode on one struct type is different than the hashcode for the same struct type that's on a different node (which can easily happen since hashcodes across JVMS can be different)
Making it transient will avoid all those kinds of issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion, still looks good to me, I thought I spotted an issue in how the default hashCode is initialized but not really worth addressing imo.
@@ -723,6 +723,9 @@ public int hashCode() { | |||
|
|||
public static class StructType extends NestedType { | |||
private static final Joiner FIELD_SEP = Joiner.on(", "); | |||
private static final int NO_HASHCODE = Integer.MIN_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was originally going to suggest we follow the same pattern we do in CharSequenceWrapper
where we have two fields, the hashCode and a boolean hashIsZero
flag. This way in case the hashCode is actually zero we don't recompute it.
In the current implementation, we would recompute the hashCode if it's actually Integer.MIN_VALUE but arguably not worth the complexity to handle that for this case.
close #11763