-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43911: [C++] Compute Row: ListKeyEncoder Supports #43912
base: main
Are you sure you want to change the base?
Conversation
|
@@ -29,6 +29,51 @@ using internal::FirstTimeBitmapWriter; | |||
namespace compute { | |||
namespace internal { | |||
|
|||
Result<std::shared_ptr<KeyEncoder>> MakeKeyEncoder(const TypeHolder& column_type, std::shared_ptr<ExtensionType>* extension_type, MemoryPool* pool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can also return unique_ptr
here. I didn't see the purpose a shared_ptr being used
Also this function is extracted from RowEncoder
24dd410
to
72705a9
Compare
@@ -269,6 +272,190 @@ struct ARROW_EXPORT NullKeyEncoder : KeyEncoder { | |||
} | |||
}; | |||
|
|||
template <typename ListType> | |||
struct ARROW_EXPORT ListKeyEncoder : KeyEncoder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder should I put this into .cc
since it requires a lot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please do. It would be nice to hide most contents from this file into the corresponding .cc
// AddLength for each list | ||
std::vector<int32_t> child_lengthes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for AddLength
, if lots of value being used, AddLength
call AddLength(child_lengthes.data(), length)
rather than call with length 1
if (list_scalar.is_valid && list_scalar.value->length() > 0) { | ||
auto element_count = static_cast<int32_t>(list_scalar.value->length()); | ||
// Counting the size of the encoded list | ||
std::vector<int32_t> child_lengthes(element_count, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
RETURN_NOT_OK( | ||
this->element_encoder_->Encode(ExecValue{tmp_child_data}, 1, &encoded_ptr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is a bit tricky, since Encode
don't has interface for "encode 1 element", so this call Encode(1)
.
ARROW_ASSIGN_OR_RAISE(auto element_array, ::arrow::Concatenate(child_datas, pool)); | ||
element_data = element_array->data(); | ||
} else { | ||
// If there are no elements, we need to create an empty array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires an "empty" ArrayData
@pitrou @zanmato1984 @felipecrv I've written a basic impl, the performance here might be bad but this implmenet the basic logic. Would you mind take a look? (I'll be out for vocation in 9.14 - 9.21, so maybe late response later) |
auto raw_offsets = offset_buf->mutable_span_as<Offset>(); | ||
Offset element_sum = 0; | ||
raw_offsets[0] = 0; | ||
std::vector<std::shared_ptr<Array>> child_datas; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit tricky, it always decode 1 from child, so we'll have "element-size" Array
here...
I think with something like a callback:
I didn't find efficient interface for encoder, I may go through other code for help |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete review for now
@@ -514,7 +514,7 @@ std::vector<std::shared_ptr<Array>> GenRandomUniqueRecords( | |||
val_types.push_back(result[i]->type()); | |||
} | |||
RowEncoder encoder; | |||
encoder.Init(val_types, ctx); | |||
auto s = encoder.Init(val_types, ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please at least use DCHECK_OK
.
@@ -18,10 +18,13 @@ | |||
#pragma once | |||
|
|||
#include <cstdint> | |||
#include <iostream> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For debug, will remove
@@ -269,6 +272,190 @@ struct ARROW_EXPORT NullKeyEncoder : KeyEncoder { | |||
} | |||
}; | |||
|
|||
template <typename ListType> | |||
struct ARROW_EXPORT ListKeyEncoder : KeyEncoder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please do. It would be nice to hide most contents from this file into the corresponding .cc
@@ -269,6 +272,190 @@ struct ARROW_EXPORT NullKeyEncoder : KeyEncoder { | |||
} | |||
}; | |||
|
|||
template <typename ListType> | |||
struct ARROW_EXPORT ListKeyEncoder : KeyEncoder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment explaining how the encoding looks like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see you added a comment below.
VisitBitBlocksVoid( | ||
validity, data.array.offset, data.array.length, | ||
[&](int64_t i) { | ||
ARROW_UNUSED(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used.
validity, data.array.offset, data.array.length, | ||
[&](int64_t i) { | ||
ARROW_UNUSED(i); | ||
child_lengthes.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to call clear
as you're calling resize
below.
const uint8_t* validity = data.array.buffers[0].data; | ||
const auto* offsets = data.array.GetValues<Offset>(1); | ||
// AddLength for each list | ||
std::vector<int32_t> child_lengthes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector<int32_t> child_lengthes; | |
std::vector<int32_t> child_lengths; |
for (int64_t i = 0; i < child_array.length; i++) { | ||
ArraySpan tmp_child_data(child_array); | ||
tmp_child_data.SetSlice(child_array.offset + i, 1); | ||
RETURN_NOT_OK( | ||
this->element_encoder_->Encode(ExecValue{tmp_child_data}, 1, &encoded_ptr)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do it one element at a time? Why not instead:
for (int64_t i = 0; i < child_array.length; i++) { | |
ArraySpan tmp_child_data(child_array); | |
tmp_child_data.SetSlice(child_array.offset + i, 1); | |
RETURN_NOT_OK( | |
this->element_encoder_->Encode(ExecValue{tmp_child_data}, 1, &encoded_ptr)); | |
} | |
RETURN_NOT_OK( | |
this->element_encoder_->Encode(ExecValue{child_array}, child_array.length, &encoded_ptr)); |
RETURN_NOT_OK(VisitBitBlocks( | ||
validity, data.array.offset, data.array.length, | ||
[&](int64_t i) { | ||
ARROW_UNUSED(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used.
for (int64_t i = 0; i < batch_length; i++) { | ||
RETURN_NOT_OK(handle_valid_value(span)); | ||
} | ||
} else { | ||
for (int64_t i = 0; i < batch_length; i++) { | ||
handle_null_value(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could instead call handle_valid_value
or handle_null_value
once and then memcpy
the result batch_length - 1
times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small simplification.
auto& encoded_ptr = *encoded_bytes++; | ||
*encoded_ptr++ = kNullByte; | ||
util::SafeStore(encoded_ptr, static_cast<Offset>(0)); | ||
encoded_ptr += sizeof(Offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encoded_ptr += sizeof(Offset); | |
encoded_ptr += sizeof(Offset); | |
return Status::OK(); |
[&]() { | ||
handle_null_value(); | ||
return Status::OK(); | ||
})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[&]() { | |
handle_null_value(); | |
return Status::OK(); | |
})); | |
handle_null_value)); |
Rationale for this change
Add
ListKeyEncoder
supports in RowEncoderWhat changes are included in this PR?
ListKeyEncoder
supports in RowEncoderRowEncoder::Init
to return StatusAre these changes tested?
Yes
Are there any user-facing changes?
Currently not, they're internal interfaces