-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43759: [C++] Acero: Minor code enhancement for Join #43760
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -323,8 +322,7 @@ Status ResizableArrayData::ResizeFixedLengthBuffers(int num_rows_new) { | |||
} | |||
|
|||
Status ResizableArrayData::ResizeVaryingLengthBuffer() { | |||
KeyColumnMetadata column_metadata; | |||
column_metadata = ColumnMetadataFromDataType(data_type_).ValueOrDie(); | |||
KeyColumnMetadata column_metadata = ColumnMetadataFromDataType(data_type_).ValueOrDie(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use ARROW_ASSIGN_OR_RAISE
? We are able to return a Status
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it seems ColumnMetadataFromDataType(data_type_)
is being called from multiple methods. Why is the result not stored somewhere on the class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use ARROW_ASSIGN_OR_RAISE? We are able to return a Status here.
I'm new to this code, when I'm reading this code I found the Join will gurantees the type is safe from ColumnMetadataFromDataType
.
Also, it seems ColumnMetadataFromDataType(data_type_) is being called from multiple methods. Why is the result not stored somewhere on the class?
I don't know :-( But this metadata is light weight, just a working around for data_type_, whether caching this is both ok for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou Should I use column_metadata
and vendor it? Or keep it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zanmato1984 What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use
ARROW_ASSIGN_OR_RAISE
? We are able to return aStatus
here.
Right, I would suggest the same. Also, there are other occurrences of unnecessary ValueOrDie()
in this file. We'd better clean them all up.
Also, it seems
ColumnMetadataFromDataType(data_type_)
is being called from multiple methods. Why is the result not stored somewhere on the class?
Seems like ColumnMetadataFromDataType()
is not so trivial - there are a bunch of dynamic casts. So yes, it's even better to store it in a private member of ResizableArrayData
- this would save us a lot of s/ValueOrDie()/ARROW_ASSIGN_OR_RAISE
work as well:
-void ResizableArrayData::Init(const std::shared_ptr<DataType>& data_type,
+Status ResizableArrayData::Init(const std::shared_ptr<DataType>& data_type,
MemoryPool* pool, int log_num_rows_min) {
+ ARROW_ASSIGN_OR_RAISE(auto metadata_after, ColumnMetadataFromDataType(data_type));
#ifndef NDEBUG
if (num_rows_allocated_ > 0) {
ARROW_DCHECK(data_type_ != NULLPTR);
- KeyColumnMetadata metadata_before =
- ColumnMetadataFromDataType(data_type_).ValueOrDie();
- ARROW_DCHECK(metadata_before.is_fixed_length == metadata_after.is_fixed_length &&
- metadata_before.fixed_length == metadata_after.fixed_length);
+ ARROW_DCHECK(metadata_.is_fixed_length == metadata_after.is_fixed_length &&
+ metadata_.fixed_length == metadata_after.fixed_length);
}
#endif
+ metadata_ = metadata_after;
Clear(/*release_buffers=*/false);
log_num_rows_min_ = log_num_rows_min;
data_type_ = data_type;
pool_ = pool;
+ return Status::OK();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just show how we can avoid instantiating KeyColumnMetadata
, and the subsequent ValueOrDie()
.
@@ -343,20 +341,19 @@ Status ResizableArrayData::ResizeVaryingLengthBuffer() { | |||
} | |||
|
|||
KeyColumnArray ResizableArrayData::column_array() const { | |||
KeyColumnMetadata column_metadata; | |||
column_metadata = ColumnMetadataFromDataType(data_type_).ValueOrDie(); | |||
KeyColumnMetadata column_metadata = ColumnMetadataFromDataType(data_type_).ValueOrDie(); | |||
return KeyColumnArray(column_metadata, num_rows_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return KeyColumnArray(column_metadata, num_rows_, | |
return KeyColumnArray(metadata_, num_rows_, |
@@ -343,20 +341,19 @@ Status ResizableArrayData::ResizeVaryingLengthBuffer() { | |||
} | |||
|
|||
KeyColumnArray ResizableArrayData::column_array() const { | |||
KeyColumnMetadata column_metadata; | |||
column_metadata = ColumnMetadataFromDataType(data_type_).ValueOrDie(); | |||
KeyColumnMetadata column_metadata = ColumnMetadataFromDataType(data_type_).ValueOrDie(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KeyColumnMetadata column_metadata = ColumnMetadataFromDataType(data_type_).ValueOrDie(); |
I like the idea for that, but just mention that this hateful |
@@ -235,6 +237,7 @@ void ResizableArrayData::Clear(bool release_buffers) { | |||
|
|||
Status ResizableArrayData::ResizeFixedLengthBuffers(int num_rows_new) { | |||
ARROW_DCHECK(num_rows_new >= 0); | |||
ARROW_DCHECK(data_type_ != nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this (and several other similar) check? I think we can assume that these functions will always be called after an Init
call, where data_type_
would be set to a non-null.
Rather we should add such a check in Init
instead: the first use of passed data_type
in Init
is:
https://github.com/apache/arrow/pull/43760/files/990e9a15042a31fa8e9cfc07cb62da65cd11092e..87b5d061a8fe2c027b29335b07d2c66969688a23#diff-dca2a0b71d9a10c8634649df065d2631cadf93e3a05eefa5b83ae9d55348b63fR219
will give NPE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this (and several other similar) check? I think we can assume that these functions will always be called after an Init call, where data_type_ would be set to a non-null.
I just DCHECK(data_type_)
to DCHECK it's initialized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just feels weird to ensure having been initialized by checking a not-so-relative member data_type_
. But OK I get it. If so, please also check the passed-in data_type
of Init
being non-null - you can see that we don't have any guard on its nullability, so we may end up with NPE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just feels weird to ensure having been initialized by checking a not-so-relative member data_type_.
Since we don't have initialized, lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in b2223ff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @zanmato1984 that this doesn't seem terribly useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest of changes LGTM though.
ARROW_DCHECK(metadata_before.is_fixed_length == metadata_after.is_fixed_length && | ||
metadata_before.fixed_length == metadata_after.fixed_length); | ||
} | ||
#endif | ||
ARROW_DCHECK(data_type != nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you avoid duplicating this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -235,6 +237,7 @@ void ResizableArrayData::Clear(bool release_buffers) { | |||
|
|||
Status ResizableArrayData::ResizeFixedLengthBuffers(int num_rows_new) { | |||
ARROW_DCHECK(num_rows_new >= 0); | |||
ARROW_DCHECK(data_type_ != nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @zanmato1984 that this doesn't seem terribly useful.
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 4f91c8f. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…43760) ### Rationale for this change Minor style enhancement for join ### What changes are included in this PR? Minor style enhancement for join ### Are these changes tested? Covered by existing ### Are there any user-facing changes? no * GitHub Issue: apache#43759 Authored-by: mwish <[email protected]> Signed-off-by: mwish <[email protected]>
…43760) ### Rationale for this change Minor style enhancement for join ### What changes are included in this PR? Minor style enhancement for join ### Are these changes tested? Covered by existing ### Are there any user-facing changes? no * GitHub Issue: apache#43759 Authored-by: mwish <[email protected]> Signed-off-by: mwish <[email protected]>
Rationale for this change
Minor style enhancement for join
What changes are included in this PR?
Minor style enhancement for join
Are these changes tested?
Covered by existing
Are there any user-facing changes?
no