-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(parquet): Add boolean rle decoder for Parquet #11282
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
358855b
to
f638683
Compare
f638683
to
a259672
Compare
f5735ba
to
163bdb3
Compare
a8f69b9
to
c25886f
Compare
e8e82c4
to
c760214
Compare
@yingsu00 can you also take a look at this PR? Thank you! |
c760214
to
5b72579
Compare
99e8783
to
4b8412d
Compare
public: | ||
using super = RleBpDecoder; | ||
RleBooleanDecoder(const char* start, const char* end, int32_t& len) | ||
: super::RleBpDecoder{start + 4, end, 1} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number 4 is used multiple times. Please make it a static const here and use an appropriate name for it.
You don't need super::
here. There is no ambiguity here.
"Received invalid length : " + std::to_string(len) + | ||
" (corrupt data page?)"); | ||
} | ||
// num_bytes will be the first 4 bytes that tell us the length of encoded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Do we need this comment once the 4 is not magic number anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right removing comment since it is not necessary, thank you!
|
||
template <bool hasNulls> | ||
inline void skip(int32_t numValues, int32_t current, const uint64_t* nulls) { | ||
if (hasNulls) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a constexpr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to constexpr, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function itself doesn't need to be constexpr. It is the if condition that should be constexpr.
if constexpr (hasNulls)
That means if the template argument is false this if expression is not generated.
numValues = bits::countNonNulls(nulls, current, current + numValues); | ||
} | ||
|
||
super::skip(numValues); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be RleBpDecoder::skip(numValues)
to disambiguate the function from this->skip(numValues)
?
int32_t current = visitor.start(); | ||
|
||
skip<hasNulls>(current, 0, nulls); | ||
int32_t toSkip; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets also initialize it to 0.
int32_t toSkip; | ||
bool atEnd = false; | ||
const bool allowNulls = hasNulls && visitor.allowNulls(); | ||
std::vector<uint64_t> outputBuffer(20); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the size of the vector 20?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry was using this output buffer for some other testing since it's not being used anymore will delete this line of code
++current; | ||
if (toSkip) { | ||
skip<hasNulls>(toSkip, current, nulls); | ||
current += toSkip; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no problem if toSkip > 0
but we already advanced current by 1 on line 97? I suppose this someting about what visitor represents? This might need a comment to explain why this is ok.
Or maybe some comment on how the algorithm works when the read occurs.
} | ||
|
||
const char* bufferStart_; | ||
uint32_t num_bytes = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be named numBytes_
.
int64_t readBitField() { | ||
auto value = | ||
dwio::common::safeLoadBits( | ||
super::bufferStart_, bitOffset_, bitWidth_, lastSafeWord_) & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment above about using super vs the base class name to disambiguate.
4b8412d
to
b850e02
Compare
|
||
class RleBooleanDecoder : public RleBpDecoder { | ||
public: | ||
using super = RleBpDecoder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you've replaced super with the actual base class we don't need this anymore. I did not see that you defined super here. This explains why it was working before. But someone not familiar with Java would be confused so better to be explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you removed this line!
RleBooleanDecoder(const char* start, const char* end, int32_t& len) | ||
: RleBpDecoder{start + kLengthOffset, end, 1} { | ||
if (len < kLengthOffset) { | ||
VELOX_FAIL( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets replace the std::to_string with fmt provided in the VELOX_FAIL like so:
VELOX_FAIL("Received invalid length : {} (corrupt data page?)", len);
for all occurrences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, updated!
|
||
template <bool hasNulls> | ||
inline void skip(int32_t numValues, int32_t current, const uint64_t* nulls) { | ||
if (hasNulls) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function itself doesn't need to be constexpr. It is the if condition that should be constexpr.
if constexpr (hasNulls)
That means if the template argument is false this if expression is not generated.
public: | ||
using super = RleBpDecoder; | ||
static constexpr int32_t kLengthOffset = 4; | ||
RleBooleanDecoder(const char* start, const char* end, int32_t& len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len
is not modified here and we don't need a reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, thank you!
RleBpDecoder::bufferStart_, bitOffset_, bitWidth_, lastSafeWord_) & | ||
bitMask_; | ||
bitOffset_ += bitWidth_; | ||
RleBpDecoder::bufferStart_ += bitOffset_ >> 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is not clear to me is why you modify the base class member here. You have your own bufferStart_ so why not processing and modifying this using the base class methods (which you have to some degree).
The base class member (with the same name) is initialized in the constructor. But because you declared a new member of the same name that is never used on line 118 you need to explicitly refer to the base class here when this member is inherited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch forgot to remove my own bufferStart_ which was being used for something else cleaned up the code with removing it, thank you!
6bbaab0
to
27ad3f1
Compare
27ad3f1
to
0eba7a4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good now.
Please update the commit and PR abstract (header line) to have feat(parquet):
.
@majetideepak please take a look when you get a chance.
--remainingValues_; | ||
} | ||
// Will increment the current by one and if value of toSkip > 0 | ||
// We will count the number of non nulls for the bpdecoding and skip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the formatting.
The comment says counting the number of non-null
values - but it might as well nulls because hasNulls is the template that could be called either way.
I notice that is code from other decoders as well. Basically, if it is determined something can be skipped after processing do it.
Perhaps the comment isn't necessary after all.
0eba7a4
to
7fb23f6
Compare
Co-authored-by: Minhan Cao <[email protected]>
7fb23f6
to
2d0751e
Compare
RLE/BP is an Encoding for Boolean values for Parquet Version 2 files.
https://parquet.apache.org/docs/file-format/data-pages/encodings/
Closes #10943