Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43759: [C++] Acero: Minor code enhancement for Join #43760

Merged
merged 5 commits into from
Aug 29, 2024

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Aug 19, 2024

Rationale for this change

Minor style enhancement for join

What changes are included in this PR?

Minor style enhancement for join

Are these changes tested?

Covered by existing

Are there any user-facing changes?

no

@mapleFU mapleFU requested a review from westonpace as a code owner August 19, 2024 19:58
Copy link

⚠️ GitHub issue #43759 has been automatically assigned in GitHub to PR creator.

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 20, 2024
@@ -323,8 +322,7 @@ Status ResizableArrayData::ResizeFixedLengthBuffers(int num_rows_new) {
}

Status ResizableArrayData::ResizeVaryingLengthBuffer() {
KeyColumnMetadata column_metadata;
column_metadata = ColumnMetadataFromDataType(data_type_).ValueOrDie();
KeyColumnMetadata column_metadata = ColumnMetadataFromDataType(data_type_).ValueOrDie();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use ARROW_ASSIGN_OR_RAISE? We are able to return a Status here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it seems ColumnMetadataFromDataType(data_type_) is being called from multiple methods. Why is the result not stored somewhere on the class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use ARROW_ASSIGN_OR_RAISE? We are able to return a Status here.

I'm new to this code, when I'm reading this code I found the Join will gurantees the type is safe from ColumnMetadataFromDataType.

Also, it seems ColumnMetadataFromDataType(data_type_) is being called from multiple methods. Why is the result not stored somewhere on the class?

I don't know :-( But this metadata is light weight, just a working around for data_type_, whether caching this is both ok for me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou Should I use column_metadata and vendor it? Or keep it here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zanmato1984 What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use ARROW_ASSIGN_OR_RAISE? We are able to return a Status here.

Right, I would suggest the same. Also, there are other occurrences of unnecessary ValueOrDie() in this file. We'd better clean them all up.

Also, it seems ColumnMetadataFromDataType(data_type_) is being called from multiple methods. Why is the result not stored somewhere on the class?

Seems like ColumnMetadataFromDataType() is not so trivial - there are a bunch of dynamic casts. So yes, it's even better to store it in a private member of ResizableArrayData - this would save us a lot of s/ValueOrDie()/ARROW_ASSIGN_OR_RAISE work as well:

-void ResizableArrayData::Init(const std::shared_ptr<DataType>& data_type,
+Status ResizableArrayData::Init(const std::shared_ptr<DataType>& data_type,
                              MemoryPool* pool, int log_num_rows_min) {
+  ARROW_ASSIGN_OR_RAISE(auto metadata_after, ColumnMetadataFromDataType(data_type));
#ifndef NDEBUG
  if (num_rows_allocated_ > 0) {
    ARROW_DCHECK(data_type_ != NULLPTR);
-    KeyColumnMetadata metadata_before =
-        ColumnMetadataFromDataType(data_type_).ValueOrDie();
-    ARROW_DCHECK(metadata_before.is_fixed_length == metadata_after.is_fixed_length &&
-                 metadata_before.fixed_length == metadata_after.fixed_length);
+    ARROW_DCHECK(metadata_.is_fixed_length == metadata_after.is_fixed_length &&
+                 metadata_.fixed_length == metadata_after.fixed_length);
  }
#endif
+  metadata_ = metadata_after;
  Clear(/*release_buffers=*/false);
  log_num_rows_min_ = log_num_rows_min;
  data_type_ = data_type;
  pool_ = pool;
+  return Status::OK();
}

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just show how we can avoid instantiating KeyColumnMetadata, and the subsequent ValueOrDie().

@@ -343,20 +341,19 @@ Status ResizableArrayData::ResizeVaryingLengthBuffer() {
}

KeyColumnArray ResizableArrayData::column_array() const {
KeyColumnMetadata column_metadata;
column_metadata = ColumnMetadataFromDataType(data_type_).ValueOrDie();
KeyColumnMetadata column_metadata = ColumnMetadataFromDataType(data_type_).ValueOrDie();
return KeyColumnArray(column_metadata, num_rows_,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return KeyColumnArray(column_metadata, num_rows_,
return KeyColumnArray(metadata_, num_rows_,

@@ -343,20 +341,19 @@ Status ResizableArrayData::ResizeVaryingLengthBuffer() {
}

KeyColumnArray ResizableArrayData::column_array() const {
KeyColumnMetadata column_metadata;
column_metadata = ColumnMetadataFromDataType(data_type_).ValueOrDie();
KeyColumnMetadata column_metadata = ColumnMetadataFromDataType(data_type_).ValueOrDie();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
KeyColumnMetadata column_metadata = ColumnMetadataFromDataType(data_type_).ValueOrDie();

@mapleFU
Copy link
Member Author

mapleFU commented Aug 28, 2024

I like the idea for that, but just mention that this hateful ValueOrDie is called widely in row-api, sigh

@mapleFU mapleFU requested a review from zanmato1984 August 28, 2024 11:31
@@ -235,6 +237,7 @@ void ResizableArrayData::Clear(bool release_buffers) {

Status ResizableArrayData::ResizeFixedLengthBuffers(int num_rows_new) {
ARROW_DCHECK(num_rows_new >= 0);
ARROW_DCHECK(data_type_ != nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this (and several other similar) check? I think we can assume that these functions will always be called after an Init call, where data_type_ would be set to a non-null.

Rather we should add such a check in Init instead: the first use of passed data_type in Init is:
https://github.com/apache/arrow/pull/43760/files/990e9a15042a31fa8e9cfc07cb62da65cd11092e..87b5d061a8fe2c027b29335b07d2c66969688a23#diff-dca2a0b71d9a10c8634649df065d2631cadf93e3a05eefa5b83ae9d55348b63fR219
will give NPE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this (and several other similar) check? I think we can assume that these functions will always be called after an Init call, where data_type_ would be set to a non-null.

I just DCHECK(data_type_) to DCHECK it's initialized

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just feels weird to ensure having been initialized by checking a not-so-relative member data_type_. But OK I get it. If so, please also check the passed-in data_type of Init being non-null - you can see that we don't have any guard on its nullability, so we may end up with NPE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just feels weird to ensure having been initialized by checking a not-so-relative member data_type_.

Since we don't have initialized, lol

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in b2223ff

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @zanmato1984 that this doesn't seem terribly useful.

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest of changes LGTM though.

@mapleFU mapleFU requested a review from pitrou August 29, 2024 04:48
ARROW_DCHECK(metadata_before.is_fixed_length == metadata_after.is_fixed_length &&
metadata_before.fixed_length == metadata_after.fixed_length);
}
#endif
ARROW_DCHECK(data_type != nullptr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you avoid duplicating this line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -235,6 +237,7 @@ void ResizableArrayData::Clear(bool release_buffers) {

Status ResizableArrayData::ResizeFixedLengthBuffers(int num_rows_new) {
ARROW_DCHECK(num_rows_new >= 0);
ARROW_DCHECK(data_type_ != nullptr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @zanmato1984 that this doesn't seem terribly useful.

@mapleFU mapleFU requested a review from pitrou August 29, 2024 14:58
@mapleFU mapleFU merged commit 4f91c8f into apache:main Aug 29, 2024
37 of 40 checks passed
@mapleFU mapleFU deleted the minor-join-stype-enhancement branch August 29, 2024 15:38
@mapleFU mapleFU removed the awaiting committer review Awaiting committer review label Aug 29, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 4f91c8f.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them.

mapleFU added a commit to mapleFU/arrow that referenced this pull request Sep 3, 2024
…43760)

### Rationale for this change

Minor style enhancement for join

### What changes are included in this PR?

Minor style enhancement for join

### Are these changes tested?

Covered by existing

### Are there any user-facing changes?

no

* GitHub Issue: apache#43759

Authored-by: mwish <[email protected]>
Signed-off-by: mwish <[email protected]>
khwilson pushed a commit to khwilson/arrow that referenced this pull request Sep 14, 2024
…43760)

### Rationale for this change

Minor style enhancement for join

### What changes are included in this PR?

Minor style enhancement for join

### Are these changes tested?

Covered by existing

### Are there any user-facing changes?

no

* GitHub Issue: apache#43759

Authored-by: mwish <[email protected]>
Signed-off-by: mwish <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants