Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add decimal column writer for ORC file format #11431

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rui-mo
Copy link
Collaborator

@rui-mo rui-mo commented Nov 5, 2024

'SelectiveDecimalColumnReader' is typically used for reading decimal column in
ORC file. This PR supports corresponding writer for decimal column in ORC file.
Adds 'format' in 'WriterOptions' to distinguish between DWRF and ORC when
writing.

#11067 (comment)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 5, 2024
Copy link

netlify bot commented Nov 5, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 717711b
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/673d8782bbaf85000768f036

velox/dwio/dwrf/common/IntEncoder.h Outdated Show resolved Hide resolved
velox/dwio/dwrf/common/IntEncoder.h Outdated Show resolved Hide resolved
velox/dwio/dwrf/writer/ColumnWriter.cpp Outdated Show resolved Hide resolved
velox/dwio/dwrf/writer/ColumnWriter.cpp Outdated Show resolved Hide resolved
velox/dwio/dwrf/test/ColumnWriterTest.cpp Outdated Show resolved Hide resolved
velox/dwio/dwrf/writer/ColumnWriter.cpp Outdated Show resolved Hide resolved
auto [_, scale] = getDecimalPrecisionScale(*type_);

for (auto& pos : ranges) {
if (!decodedVector.mayHaveNulls() || !decodedVector.isNullAt(pos)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the mayHaveNulls out of for loop

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could also add fast path for flat-encoded vector. Will add these fast branches after #11431 (comment) is decided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have one loop for may have nulls and one without? thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Added separate paths according to 'mayHaveNulls' and the decimal type.

Comment on lines 2121 to 2156
case TypeKind::BIGINT: {
if (type.type()->isDecimal()) {
return std::make_unique<DecimalColumnWriter>(
context, type, sequence, onRecordPosition);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noob question, I saw reader having a different filter, is this expected?

if (type.format() == DwrfFormat::kOrc &&

Copy link
Collaborator Author

@rui-mo rui-mo Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing. I assume we need to fix the reader base so as to get the correct fileType. In the ColumnWriterTest, the reader is created with 'ColumnReader::build' using the test data type so it does not trigger above mismatch.

if (fileType->type()->isDecimal()) {
return std::make_unique<DecimalColumnReader<int64_t>>(
requestedType->type(),
fileType,
stripe,
streamLabels,
std::move(flatMapContext));

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: dwrf does not include a 'decimal' file type, so the decimal column reader and writer are both for ORC file format. This PR adds 'format' in 'WriterOptions' to distinguish between DWRF and ORC when writing. Thanks!

@rui-mo rui-mo force-pushed the wip_decimal_writer branch from 12dbc0e to 1488742 Compare November 7, 2024 09:42
if (type_->isShortDecimal()) {
auto val = decodedVector.valueAt<int64_t>(pos);
unscaledValues_->writeValue(val);
scales_->writeValue(scale);
Copy link
Collaborator Author

@rui-mo rui-mo Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need to truncate the zeros before writing, since I notice that the decimal reader could recover a decimal value to the target scale with the recorded 'scales'. For example, 123.000 of scale 3 could be written as 123'000 with scale 3 or 123 with scale 0. To truncate the tailing zeros might benefits for the compression ratio and helps reduce file size, but it requires extra conversion for each value and could increase the write time. @Yuhta Do you have any suggestion? Thanks!

inline static void fillDecimals(
T* decimals,
const uint64_t* nullsPtr,
const T* values,
const int64_t* scales,
int32_t numValues,
int32_t targetScale) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the other file format do for this? We could keep this simple to optimize processing time.

Copy link
Contributor

@Yuhta Yuhta Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From compute engine standpoint we would shorter prefer process time over small storage size. Both alternatives generate valid files though, I think we can use whatever is simpler to implement for now and come back later if we see there is need to change.

Copy link
Collaborator Author

@rui-mo rui-mo Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice Parquet does not contain a scales vector. Thanks for your suggestion! I keep it this way.

@rui-mo rui-mo force-pushed the wip_decimal_writer branch from 1488742 to 43d046a Compare November 12, 2024 06:36
Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo LGTM % minors. Thanks!

const TypeWithId& type,
uint32_t sequence,
std::function<void(IndexBuilder&)> onRecordPosition)
: BaseColumnWriter{context, type, sequence, onRecordPosition},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::move(onRecordPosition)

newStream(StreamKind::StreamKind_DATA),
getConfig(Config::USE_VINTS),
type_->isShortDecimal() ? LONG_BYTE_SIZE : 2 * LONG_BYTE_SIZE)},
scales_{createRleEncoder</* isSigned = */ true>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: /*isSigned=*/

FOLLY_ALWAYS_INLINE void writeVsHugeInt(int128_t val) {
writeVuHugeInt(ZigZag::encodeInt128(val));
}
FOLLY_ALWAYS_INLINE void writeHugeIntLE(int128_t val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave an empty line in between

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. And I removed 'writeHugeIntLE' method due to #11431 (comment). Thanks.

@@ -117,6 +117,18 @@ class IntEncoder {
}
}

void writeHugeInt(int128_t value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have tests for these new APIs? thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added TEST_F(DirectTest, hugeInts). Since the write buffer is not accessible, I test it by encoding, decoding, and comparing the values.

@@ -273,6 +273,10 @@ class ZigZag {
return (static_cast<uint64_t>(val) << 1) ^ (val >> 63);
}

static __uint128_t encodeInt128(__int128_t val) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a test for this? Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added TEST_F(ZigZagTest, hugeInt). Thanks.

writeNulls(decodedVector, ranges);

size_t count = 0;
auto [_, scale] = getDecimalPrecisionScale(*type_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scale_ = getDecimalPrecisionScale(*inputTypes[0]).second

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a local variable scale_ to avoid extracting it in each call.

for (auto& pos : ranges) {
if (!decodedVector.mayHaveNulls() || !decodedVector.isNullAt(pos)) {
if (type_->isShortDecimal()) {
auto val = decodedVector.valueAt<int64_t>(pos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto val =

unscaledValues_->writeValue(val);
scales_->writeValue(scale);
} else {
auto val = decodedVector.valueAt<int128_t>(pos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


for (auto& pos : ranges) {
if (!decodedVector.mayHaveNulls() || !decodedVector.isNullAt(pos)) {
if (type_->isShortDecimal()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and type condition

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added separate branch according to isShortDecimal. Thanks.

if (type_->isShortDecimal()) {
auto val = decodedVector.valueAt<int64_t>(pos);
unscaledValues_->writeValue(val);
scales_->writeValue(scale);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the other file format do for this? We could keep this simple to optimize processing time.

@rui-mo rui-mo force-pushed the wip_decimal_writer branch from 43d046a to aaa8789 Compare November 15, 2024 12:30
@rui-mo rui-mo changed the title Add decimal column writer for dwrf file format feat: Add decimal column writer for dwrf file format Nov 15, 2024
unscaledValues_{createDirectEncoder<true /*isSigned*/>(
newStream(StreamKind::StreamKind_DATA),
// IntDecoder and IntEncoder only support vInts for huge ints.
isShortDecimal_ ? getConfig(Config::USE_VINTS) : true /*useVInts*/,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out the int decoder only supports vInts. And I removed the support for fixed-length and always use vInts for long decimal. @xiaoxmeng Do you think it makes sense? Thanks.

if constexpr (std::is_same_v<T, int128_t>) {
VELOX_NYI();
}

valuesPtr[index++] = val.value();
}
return std::make_shared<FlatVector<T>>(
pool, type, nullptr, data.size(), values, std::vector<BufferPtr>{});
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When no null, use nullptr as null buffer. This helps test the !mayHaveNulls fast path.

@rui-mo
Copy link
Collaborator Author

rui-mo commented Nov 15, 2024

@xiaoxmeng @Yuhta Above comments are fixed. Would you like to take a review again? Thank you.

@rui-mo rui-mo changed the title feat: Add decimal column writer for dwrf file format feat: Add decimal column writer for orc file format Nov 19, 2024
@rui-mo rui-mo marked this pull request as draft November 19, 2024 08:42
@rui-mo rui-mo changed the title feat: Add decimal column writer for orc file format feat: Add decimal column writer for ORC file format Nov 20, 2024
@rui-mo rui-mo force-pushed the wip_decimal_writer branch from aaa8789 to e64fdc5 Compare November 20, 2024 06:41
@rui-mo rui-mo marked this pull request as ready for review November 20, 2024 06:41
@rui-mo rui-mo force-pushed the wip_decimal_writer branch from e64fdc5 to 717711b Compare November 20, 2024 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants