Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parquet): Add boolean rle decoder for Parquet #11282

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jkhaliqi
Copy link
Contributor

@jkhaliqi jkhaliqi commented Oct 16, 2024

RLE/BP is an Encoding for Boolean values for Parquet Version 2 files.
https://parquet.apache.org/docs/file-format/data-pages/encodings/
Closes #10943

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 16, 2024
Copy link

netlify bot commented Oct 16, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 2d0751e
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67577509638e7a000822e6ce

@Yuhta Yuhta changed the title added in rle encoder for boolean added in rle decoder for boolean Oct 17, 2024
@Yuhta Yuhta self-requested a review October 17, 2024 21:04
@jkhaliqi jkhaliqi force-pushed the rle_encoding branch 2 times, most recently from 358855b to f638683 Compare October 29, 2024 19:22
@jkhaliqi jkhaliqi force-pushed the rle_encoding branch 3 times, most recently from f5735ba to 163bdb3 Compare October 30, 2024 18:03
@jkhaliqi jkhaliqi marked this pull request as ready for review October 30, 2024 18:17
@jkhaliqi jkhaliqi force-pushed the rle_encoding branch 2 times, most recently from a8f69b9 to c25886f Compare October 30, 2024 19:24
@minhancao minhancao force-pushed the rle_encoding branch 2 times, most recently from e8e82c4 to c760214 Compare November 1, 2024 20:45
@ethanyzhang
Copy link

@yingsu00 can you also take a look at this PR? Thank you!

@majetideepak majetideepak changed the title added in rle decoder for boolean feat: Add boolean rle decoder for Parquet Nov 12, 2024
@jkhaliqi jkhaliqi force-pushed the rle_encoding branch 2 times, most recently from 99e8783 to 4b8412d Compare November 12, 2024 22:32
public:
using super = RleBpDecoder;
RleBooleanDecoder(const char* start, const char* end, int32_t& len)
: super::RleBpDecoder{start + 4, end, 1} {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 4 is used multiple times. Please make it a static const here and use an appropriate name for it.

You don't need super:: here. There is no ambiguity here.

"Received invalid length : " + std::to_string(len) +
" (corrupt data page?)");
}
// num_bytes will be the first 4 bytes that tell us the length of encoded
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Do we need this comment once the 4 is not magic number anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right removing comment since it is not necessary, thank you!


template <bool hasNulls>
inline void skip(int32_t numValues, int32_t current, const uint64_t* nulls) {
if (hasNulls) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a constexpr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to constexpr, thank you!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function itself doesn't need to be constexpr. It is the if condition that should be constexpr.

if constexpr (hasNulls)

That means if the template argument is false this if expression is not generated.

numValues = bits::countNonNulls(nulls, current, current + numValues);
}

super::skip(numValues);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be RleBpDecoder::skip(numValues) to disambiguate the function from this->skip(numValues)?

int32_t current = visitor.start();

skip<hasNulls>(current, 0, nulls);
int32_t toSkip;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets also initialize it to 0.

int32_t toSkip;
bool atEnd = false;
const bool allowNulls = hasNulls && visitor.allowNulls();
std::vector<uint64_t> outputBuffer(20);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the size of the vector 20?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry was using this output buffer for some other testing since it's not being used anymore will delete this line of code

++current;
if (toSkip) {
skip<hasNulls>(toSkip, current, nulls);
current += toSkip;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no problem if toSkip > 0 but we already advanced current by 1 on line 97? I suppose this someting about what visitor represents? This might need a comment to explain why this is ok.

Or maybe some comment on how the algorithm works when the read occurs.

}

const char* bufferStart_;
uint32_t num_bytes = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be named numBytes_.

int64_t readBitField() {
auto value =
dwio::common::safeLoadBits(
super::bufferStart_, bitOffset_, bitWidth_, lastSafeWord_) &
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment above about using super vs the base class name to disambiguate.


class RleBooleanDecoder : public RleBpDecoder {
public:
using super = RleBpDecoder;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you've replaced super with the actual base class we don't need this anymore. I did not see that you defined super here. This explains why it was working before. But someone not familiar with Java would be confused so better to be explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you removed this line!

RleBooleanDecoder(const char* start, const char* end, int32_t& len)
: RleBpDecoder{start + kLengthOffset, end, 1} {
if (len < kLengthOffset) {
VELOX_FAIL(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets replace the std::to_string with fmt provided in the VELOX_FAIL like so:

VELOX_FAIL("Received invalid length : {} (corrupt data page?)", len);

for all occurrences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, updated!


template <bool hasNulls>
inline void skip(int32_t numValues, int32_t current, const uint64_t* nulls) {
if (hasNulls) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function itself doesn't need to be constexpr. It is the if condition that should be constexpr.

if constexpr (hasNulls)

That means if the template argument is false this if expression is not generated.

public:
using super = RleBpDecoder;
static constexpr int32_t kLengthOffset = 4;
RleBooleanDecoder(const char* start, const char* end, int32_t& len)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len is not modified here and we don't need a reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thank you!

RleBpDecoder::bufferStart_, bitOffset_, bitWidth_, lastSafeWord_) &
bitMask_;
bitOffset_ += bitWidth_;
RleBpDecoder::bufferStart_ += bitOffset_ >> 3;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is not clear to me is why you modify the base class member here. You have your own bufferStart_ so why not processing and modifying this using the base class methods (which you have to some degree).
The base class member (with the same name) is initialized in the constructor. But because you declared a new member of the same name that is never used on line 118 you need to explicitly refer to the base class here when this member is inherited.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch forgot to remove my own bufferStart_ which was being used for something else cleaned up the code with removing it, thank you!

@jkhaliqi jkhaliqi force-pushed the rle_encoding branch 2 times, most recently from 6bbaab0 to 27ad3f1 Compare November 23, 2024 00:09
Copy link
Collaborator

@czentgr czentgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good now.
Please update the commit and PR abstract (header line) to have feat(parquet):.

@majetideepak please take a look when you get a chance.

--remainingValues_;
}
// Will increment the current by one and if value of toSkip > 0
// We will count the number of non nulls for the bpdecoding and skip
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the formatting.
The comment says counting the number of non-null values - but it might as well nulls because hasNulls is the template that could be called either way.
I notice that is code from other decoders as well. Basically, if it is determined something can be skipped after processing do it.
Perhaps the comment isn't necessary after all.

@jkhaliqi jkhaliqi changed the title feat: Add boolean rle decoder for Parquet feat(parquet): Add boolean rle decoder for Parquet Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add decoder for RLE parquet encoding
5 participants