feat: Add decimal column writer for ORC file format #11431

rui-mo · 2024-11-05T03:23:47Z

'SelectiveDecimalColumnReader' is typically used for reading decimal column in
ORC file. This PR supports corresponding writer for decimal column in ORC file.
Adds 'format' in 'WriterOptions' to distinguish between DWRF and ORC when
writing.

#11067 (comment)

netlify · 2024-11-05T03:24:02Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`717711b`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/673d8782bbaf85000768f036

velox/dwio/dwrf/common/IntEncoder.h

velox/dwio/dwrf/writer/ColumnWriter.cpp

velox/dwio/dwrf/test/ColumnWriterTest.cpp

velox/dwio/dwrf/writer/ColumnWriter.cpp

jinchengchenghh · 2024-11-05T07:47:14Z

velox/dwio/dwrf/writer/ColumnWriter.cpp

+    auto [_, scale] = getDecimalPrecisionScale(*type_);
+
+    for (auto& pos : ranges) {
+      if (!decodedVector.mayHaveNulls() || !decodedVector.isNullAt(pos)) {


Move the mayHaveNulls out of for loop

Yes, we could also add fast path for flat-encoded vector. Will add these fast branches after #11431 (comment) is decided.

Let's have one loop for may have nulls and one without? thanks!

Thanks for the suggestion. Added separate paths according to 'mayHaveNulls' and the decimal type.

kewang1024 · 2024-11-06T01:01:53Z

velox/dwio/dwrf/writer/ColumnWriter.cpp

+    case TypeKind::BIGINT: {
+      if (type.type()->isDecimal()) {
+        return std::make_unique<DecimalColumnWriter>(
+            context, type, sequence, onRecordPosition);
+      }


Noob question, I saw reader having a different filter, is this expected?

velox/velox/dwio/dwrf/reader/ReaderBase.cpp

Line 288 in bfc199f

if (type.format() == DwrfFormat::kOrc &&

Thanks for noticing. I assume we need to fix the reader base so as to get the correct fileType. In the ColumnWriterTest, the reader is created with 'ColumnReader::build' using the test data type so it does not trigger above mismatch.

velox/velox/dwio/dwrf/reader/ColumnReader.cpp

Lines 2459 to 2465 in d1bf9da

if (fileType->type()->isDecimal()) {

return std::make_unique<DecimalColumnReader<int64_t>>(

requestedType->type(),

fileType,

stripe,

streamLabels,

std::move(flatMapContext));

Correction: dwrf does not include a 'decimal' file type, so the decimal column reader and writer are both for ORC file format. This PR adds 'format' in 'WriterOptions' to distinguish between DWRF and ORC when writing. Thanks!

rui-mo · 2024-11-07T09:58:35Z

velox/dwio/dwrf/writer/ColumnWriter.cpp

+        if (type_->isShortDecimal()) {
+          auto val = decodedVector.valueAt<int64_t>(pos);
+          unscaledValues_->writeValue(val);
+          scales_->writeValue(scale);


I wonder if we need to truncate the zeros before writing, since I notice that the decimal reader could recover a decimal value to the target scale with the recorded 'scales'. For example, 123.000 of scale 3 could be written as 123'000 with scale 3 or 123 with scale 0. To truncate the tailing zeros might benefits for the compression ratio and helps reduce file size, but it requires extra conversion for each value and could increase the write time. @Yuhta Do you have any suggestion? Thanks!

velox/velox/type/DecimalUtil.h

Lines 108 to 114 in d1bf9da

inline static void fillDecimals(

T* decimals,

const uint64_t* nullsPtr,

const T* values,

const int64_t* scales,

int32_t numValues,

int32_t targetScale) {

What does the other file format do for this? We could keep this simple to optimize processing time.

From compute engine standpoint we would shorter prefer process time over small storage size. Both alternatives generate valid files though, I think we can use whatever is simpler to implement for now and come back later if we see there is need to change.

I notice Parquet does not contain a scales vector. Thanks for your suggestion! I keep it this way.

velox/dwio/dwrf/writer/ColumnWriter.cpp

xiaoxmeng

@rui-mo LGTM % minors. Thanks!

xiaoxmeng · 2024-11-14T05:56:28Z

velox/dwio/dwrf/writer/ColumnWriter.cpp

+      const TypeWithId& type,
+      uint32_t sequence,
+      std::function<void(IndexBuilder&)> onRecordPosition)
+      : BaseColumnWriter{context, type, sequence, onRecordPosition},


std::move(onRecordPosition)

xiaoxmeng · 2024-11-14T05:59:18Z

velox/dwio/dwrf/writer/ColumnWriter.cpp

+            newStream(StreamKind::StreamKind_DATA),
+            getConfig(Config::USE_VINTS),
+            type_->isShortDecimal() ? LONG_BYTE_SIZE : 2 * LONG_BYTE_SIZE)},
+        scales_{createRleEncoder</* isSigned = */ true>(


nit: /*isSigned=*/

xiaoxmeng · 2024-11-14T06:05:15Z

velox/dwio/dwrf/common/IntEncoder.h

+  FOLLY_ALWAYS_INLINE void writeVsHugeInt(int128_t val) {
+    writeVuHugeInt(ZigZag::encodeInt128(val));
+  }
+  FOLLY_ALWAYS_INLINE void writeHugeIntLE(int128_t val);


Leave an empty line in between

Updated. And I removed 'writeHugeIntLE' method due to #11431 (comment). Thanks.

xiaoxmeng · 2024-11-14T06:06:48Z

velox/dwio/dwrf/common/IntEncoder.h

@@ -117,6 +117,18 @@ class IntEncoder {
    }
  }

+  void writeHugeInt(int128_t value) {


Can we have tests for these new APIs? thanks!

Added TEST_F(DirectTest, hugeInts). Since the write buffer is not accessible, I test it by encoding, decoding, and comparing the values.

xiaoxmeng · 2024-11-14T06:07:44Z

velox/common/encode/Coding.h

@@ -273,6 +273,10 @@ class ZigZag {
    return (static_cast<uint64_t>(val) << 1) ^ (val >> 63);
  }

+  static __uint128_t encodeInt128(__int128_t val) {


Can we have a test for this? Thanks!

Added TEST_F(ZigZagTest, hugeInt). Thanks.

xiaoxmeng · 2024-11-14T06:24:12Z

velox/dwio/dwrf/writer/ColumnWriter.cpp

+    writeNulls(decodedVector, ranges);
+
+    size_t count = 0;
+    auto [_, scale] = getDecimalPrecisionScale(*type_);


scale_ = getDecimalPrecisionScale(*inputTypes[0]).second

Added a local variable scale_ to avoid extracting it in each call.

xiaoxmeng · 2024-11-14T06:24:27Z

velox/dwio/dwrf/writer/ColumnWriter.cpp

+    for (auto& pos : ranges) {
+      if (!decodedVector.mayHaveNulls() || !decodedVector.isNullAt(pos)) {
+        if (type_->isShortDecimal()) {
+          auto val = decodedVector.valueAt<int64_t>(pos);


const auto val =

xiaoxmeng · 2024-11-14T06:24:35Z

velox/dwio/dwrf/writer/ColumnWriter.cpp

+          unscaledValues_->writeValue(val);
+          scales_->writeValue(scale);
+        } else {
+          auto val = decodedVector.valueAt<int128_t>(pos);


xiaoxmeng · 2024-11-14T06:24:56Z

velox/dwio/dwrf/writer/ColumnWriter.cpp

+
+    for (auto& pos : ranges) {
+      if (!decodedVector.mayHaveNulls() || !decodedVector.isNullAt(pos)) {
+        if (type_->isShortDecimal()) {


and type condition

Added separate branch according to isShortDecimal. Thanks.

xiaoxmeng · 2024-11-14T06:34:38Z

velox/dwio/dwrf/writer/ColumnWriter.cpp

+        if (type_->isShortDecimal()) {
+          auto val = decodedVector.valueAt<int64_t>(pos);
+          unscaledValues_->writeValue(val);
+          scales_->writeValue(scale);


What does the other file format do for this? We could keep this simple to optimize processing time.

rui-mo · 2024-11-15T12:45:42Z

velox/dwio/dwrf/writer/ColumnWriter.cpp

+        unscaledValues_{createDirectEncoder<true /*isSigned*/>(
+            newStream(StreamKind::StreamKind_DATA),
+            // IntDecoder and IntEncoder only support vInts for huge ints.
+            isShortDecimal_ ? getConfig(Config::USE_VINTS) : true /*useVInts*/,


It turns out the int decoder only supports vInts. And I removed the support for fixed-length and always use vInts for long decimal. @xiaoxmeng Do you think it makes sense? Thanks.

velox/velox/dwio/common/DirectDecoder.cpp

Lines 58 to 60 in 8aeb51c

if constexpr (std::is_same_v<T, int128_t>) {

VELOX_NYI();

}

rui-mo · 2024-11-15T13:08:15Z

velox/dwio/dwrf/test/ColumnWriterTest.cpp

+      valuesPtr[index++] = val.value();
+    }
+    return std::make_shared<FlatVector<T>>(
+        pool, type, nullptr, data.size(), values, std::vector<BufferPtr>{});


When no null, use nullptr as null buffer. This helps test the !mayHaveNulls fast path.

rui-mo · 2024-11-15T13:10:45Z

@xiaoxmeng @Yuhta Above comments are fixed. Would you like to take a review again? Thank you.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 5, 2024

jinchengchenghh reviewed Nov 5, 2024

View reviewed changes

Yuhta requested review from kewang1024, xiaoxmeng and HuamengJiang November 5, 2024 15:36

kewang1024 reviewed Nov 6, 2024

View reviewed changes

rui-mo force-pushed the wip_decimal_writer branch from 12dbc0e to 1488742 Compare November 7, 2024 09:42

rui-mo commented Nov 7, 2024

View reviewed changes

rui-mo force-pushed the wip_decimal_writer branch from 1488742 to 43d046a Compare November 12, 2024 06:36

jinchengchenghh reviewed Nov 12, 2024

View reviewed changes

velox/dwio/dwrf/writer/ColumnWriter.cpp Outdated Show resolved Hide resolved

xiaoxmeng reviewed Nov 14, 2024

View reviewed changes

rui-mo force-pushed the wip_decimal_writer branch from 43d046a to aaa8789 Compare November 15, 2024 12:30

rui-mo requested review from assignUser and majetideepak as code owners November 15, 2024 12:30

rui-mo changed the title ~~Add decimal column writer for dwrf file format~~ feat: Add decimal column writer for dwrf file format Nov 15, 2024

rui-mo commented Nov 15, 2024

View reviewed changes

rui-mo changed the title ~~feat: Add decimal column writer for dwrf file format~~ feat: Add decimal column writer for orc file format Nov 19, 2024

rui-mo marked this pull request as draft November 19, 2024 08:42

feat: Add decimal column writer for dwrf file format

1ac4c1c

rui-mo changed the title ~~feat: Add decimal column writer for orc file format~~ feat: Add decimal column writer for ORC file format Nov 20, 2024

Fix

717711b

rui-mo force-pushed the wip_decimal_writer branch from aaa8789 to e64fdc5 Compare November 20, 2024 06:41

rui-mo marked this pull request as ready for review November 20, 2024 06:41

rui-mo force-pushed the wip_decimal_writer branch from e64fdc5 to 717711b Compare November 20, 2024 06:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add decimal column writer for ORC file format #11431

feat: Add decimal column writer for ORC file format #11431

rui-mo commented Nov 5, 2024 •

edited

Loading

netlify bot commented Nov 5, 2024 •

edited

Loading

jinchengchenghh Nov 5, 2024

rui-mo Nov 7, 2024

xiaoxmeng Nov 14, 2024

rui-mo Nov 15, 2024

kewang1024 Nov 6, 2024

rui-mo Nov 7, 2024 •

edited

Loading

rui-mo Nov 20, 2024

rui-mo Nov 7, 2024 •

edited

Loading

xiaoxmeng Nov 14, 2024

Yuhta Nov 14, 2024 •

edited

Loading

rui-mo Nov 15, 2024 •

edited

Loading

xiaoxmeng left a comment

xiaoxmeng Nov 14, 2024

xiaoxmeng Nov 14, 2024

xiaoxmeng Nov 14, 2024

rui-mo Nov 15, 2024

xiaoxmeng Nov 14, 2024

rui-mo Nov 15, 2024

xiaoxmeng Nov 14, 2024

rui-mo Nov 15, 2024

xiaoxmeng Nov 14, 2024

jinchengchenghh Nov 15, 2024

rui-mo Nov 15, 2024

xiaoxmeng Nov 14, 2024

xiaoxmeng Nov 14, 2024

xiaoxmeng Nov 14, 2024

rui-mo Nov 15, 2024

xiaoxmeng Nov 14, 2024

rui-mo Nov 15, 2024

rui-mo Nov 15, 2024

rui-mo commented Nov 15, 2024

	if (fileType->type()->isDecimal()) {
	return std::make_unique<DecimalColumnReader<int64_t>>(
	requestedType->type(),
	fileType,
	stripe,
	streamLabels,
	std::move(flatMapContext));

	inline static void fillDecimals(
	T* decimals,
	const uint64_t* nullsPtr,
	const T* values,
	const int64_t* scales,
	int32_t numValues,
	int32_t targetScale) {

feat: Add decimal column writer for ORC file format #11431

Are you sure you want to change the base?

feat: Add decimal column writer for ORC file format #11431

Conversation

rui-mo commented Nov 5, 2024 • edited Loading

netlify bot commented Nov 5, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rui-mo Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rui-mo Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yuhta Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

rui-mo Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

xiaoxmeng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rui-mo commented Nov 15, 2024

rui-mo commented Nov 5, 2024 •

edited

Loading

netlify bot commented Nov 5, 2024 •

edited

Loading

rui-mo Nov 7, 2024 •

edited

Loading

rui-mo Nov 7, 2024 •

edited

Loading

Yuhta Nov 14, 2024 •

edited

Loading

rui-mo Nov 15, 2024 •

edited

Loading