feat(connector): Support reading Iceberg split with equality deletes #11088

yingsu00 · 2024-09-25T05:58:21Z

This PR introduces EqualityDeleteFileReader, which is used to read
Iceberg splits with equality delete files.

netlify · 2024-09-25T05:58:37Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`82b5f50`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/675a64fecf33d3000861cb02

yingsu00 · 2024-10-11T22:50:09Z

@Yuhta This PR replaces #9063 and is ready for review again. As you suggested, we now update the ScanSpec to insert/remove new columns into it instead of cloning the whole ScanSpec. Thanks a lot for reviewing!

czentgr · 2024-10-24T23:58:35Z

@yingsu00 can you please rebase again? There is a conflict with other changes.

yingsu00 · 2024-11-13T02:55:34Z

@Yuhta Just rebased, appreciate your review again. Thanks!

rui-mo · 2024-11-22T06:38:36Z

cc: @liujiayi771 Would you like to take a review? Thanks.

FelixYBW · 2024-11-26T00:06:05Z

cc: @liujiayi771 Would you like to take a review? Thanks.

@liujiayi771 Can you take a look of the PR if possible? should we add something in Gluten side after this PR merged? It's requested by a Gluten customer.

liujiayi771 · 2024-11-26T00:34:59Z

@FelixYBW Yes, Gluten needs to make some minor changes to accommodate this PR. However, Spark cannot produce equality delete files. We need to use Flink to generate Iceberg tables with equality delete files for testing. I will perform some test this week.

rui-mo

Thanks. Added some nits.

velox/connectors/hive/iceberg/CMakeLists.txt

velox/connectors/hive/iceberg/EqualityDeleteFileReader.cpp

velox/connectors/hive/iceberg/tests/IcebergReadTest.cpp

velox/dwio/common/ScanSpec.h

rui-mo · 2024-11-26T13:18:40Z

velox/connectors/hive/iceberg/FilterUtil.cpp

+      nullAllowed = true;
+    } else {
+      if constexpr (std::is_same_v<U, Timestamp>) {
+        values.emplace_back(simpleValues->valueAt(i).toMillis());


Do we need toMicros for Spark? cc: @liujiayi771

@rui-mo It will create a BigIntRange to filter data values. I'm not quite sure why we need to convert the Timestamp type to bigint here; shouldn't we be using TimestampRange instead?

@rui-mo @liujiayi771 this toValues() function is copied from velox/functions/prestosql/InPredicate.cpp. We could extract it to velox/common but there isn't a good folder yet. Any good ideas? cc @Yuhta

@yingsu00 I see. The InPredicate is also registered by Spark sql and might needs to be fixed through providing configurable behavior. For the one in the 'hive/iceberg/FilterUtil.cpp', I assume a configuration 'kReadTimestampUnit' in the hiveConfig might help adapt to different precision used by Presto and Spark. It's fine for me to leave a TODO here and focus on the primary implementation first. Thanks.

@rui-mo Is the issue here that the precision of a Timestamp in Spark is in microseconds, and converting it to milliseconds for comparison would result in a loss of precision? For example, Timestamp(1, 999000) and Timestamp(1, 998000) would be considered the same?

#11755 (comment) cc: @liujiayi771

Yuhta

The overall looks good, raise one discussion about the bookkeeping of temporary scan spec node and filters, and the rest are just wiring cleanup.

Yuhta · 2024-12-06T22:14:42Z

velox/connectors/hive/HiveConnectorUtil.cpp

@@ -672,6 +672,7 @@ bool applyPartitionFilter(
      VELOX_FAIL(
          "Bad type {} for partition value: {}", type->kind(), partitionValue);
  }
+  return true;


Do not need this

Yuhta · 2024-12-06T22:17:49Z

velox/connectors/hive/SplitReader.h

@@ -87,6 +90,8 @@ class SplitReader {

  void resetSplit();

+  std::shared_ptr<const dwio::common::TypeWithId> baseFileSchema();


This is only needed inside IcebergSplitReader

Yuhta · 2024-12-06T22:25:35Z

velox/connectors/hive/HiveDataSource.cpp

@@ -215,7 +215,10 @@ std::unique_ptr<SplitReader> HiveDataSource::createSplitReader() {
      ioStats_,
      fileHandleFactory_,
      executor_,
-      scanSpec_);
+      scanSpec_,
+      remainingFilterExprSet_,


You are not really using this inside IcebergSplitReader (instead you have your own deleteExprSet_), let's avoid sharing and passing it.

Yuhta · 2024-12-06T22:30:51Z

velox/dwio/common/ScanSpec.h

+    filters_.push_back(std::move(filter));
+  }
+
+  void updateFilter(std::unique_ptr<Filter> newFilter) {


Let's call it pushFilter and popFilter to make it more explicit that such filters are temporary

Yuhta · 2024-12-06T22:33:51Z

velox/dwio/common/ScanSpec.h

-  ScanSpec* getOrCreateChild(const Subfield& subfield);
+  ScanSpec* getOrCreateChild(const Subfield& subfield, bool isTempNode = false);
+
+  void deleteTempNodes();


Since all the temp nodes are top level, can we do it less intrusively by keeping a list of temp node/filter names inside IcebergSplitReader and remove them after we've done with it? Then we don't need the extra states of isTempNode_ and hasTempFilter_ here.

@Yuhta Thanks for the suggestion. Yes in our last discussion we thought we could just add/remove top level columns, but when I looked again I found it might be necessary to still mark the nodes with these temp tags, because the Iceberg delete file may contain subfields. For example, this query

-- create table t1 (c_row(c1 integer, c2 integer, c3 integer), c_char char); -- insert some rows select c_char from t1 where c_row.c2 > 2;

The base ScanSpec would contain the c_row child, which in turn has 3 children for c1, c2 and c3. In this case, all the three subfields would be null constants, but only c2 node would have filter c_row.c2 > 2, while c1 and c3 node don't have any filters. Now, if the Iceberg equality delete file contains the predicate c_row.c1 = 1 && c_row.c2 IN {2,3}, then we need to remove these values by adding a filter c_row.c1<>1 for the c_row.c1 node, and merge the filter c_row.c2<>2 && c_row.c2<>3 with existing filter c_row.c2 > 2 for the c_row.c2 node, so it becomes 'c_row.c2 > 3. WHen this split finishes, we need to remove the filter c_row.c1<>1` from c_row.c1, and restore the filter for c_row.c2 to its original state, but not delete the whole c_row column from the ScanSpec. Therefore we need to tag the nodes in the ScanSpec to check if it's temp node and has temp filters. Do you think this makes sense?

Yuhta · 2024-12-06T22:38:53Z

velox/expression/Expr.h

@@ -722,6 +724,12 @@ class ExprSet {
      core::ExecCtx* execCtx,
      bool enableConstantFolding = true);

+  ExprSet(


It seems these 2 are no longer needed with delete expr separated from remaining filter expr

This commit introduces EqualityDeleteFileReader, which is used to read Iceberg splits with equality delete files. Co-authored-by: Naveen Kumar Mahadevuni <[email protected]>

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 25, 2024

yingsu00 force-pushed the iceberg9.3 branch from 1e220be to e644cab Compare September 25, 2024 05:58

Yuhta self-requested a review September 25, 2024 15:02

yingsu00 force-pushed the iceberg9.3 branch from e644cab to 2c4c91e Compare September 26, 2024 01:15

yingsu00 mentioned this pull request Sep 26, 2024

Support reading Iceberg split with equality deletes #10898

Closed

yingsu00 force-pushed the iceberg9.3 branch from 2c4c91e to da6c642 Compare September 27, 2024 04:27

yingsu00 force-pushed the iceberg9.3 branch from da6c642 to a9cf8a9 Compare October 11, 2024 05:24

yingsu00 requested a review from majetideepak as a code owner October 11, 2024 05:24

yingsu00 force-pushed the iceberg9.3 branch 2 times, most recently from 1ce2688 to 72b4cef Compare November 13, 2024 00:44

yingsu00 force-pushed the iceberg9.3 branch 3 times, most recently from b18f5c0 to b15537f Compare November 14, 2024 22:55

yingsu00 changed the title ~~Support reading Iceberg split with equality deletes~~ Feature(connector): Support reading Iceberg split with equality deletes Nov 14, 2024

yingsu00 force-pushed the iceberg9.3 branch from b15537f to 8089940 Compare November 14, 2024 23:34

yingsu00 changed the title ~~Feature(connector): Support reading Iceberg split with equality deletes~~ feat(connector): Support reading Iceberg split with equality deletes Nov 14, 2024

yingsu00 force-pushed the iceberg9.3 branch from 8089940 to 66255d1 Compare November 16, 2024 03:21

yingsu00 requested a review from rui-mo November 22, 2024 06:19

liujiayi771 mentioned this pull request Nov 26, 2024

[VL] Support read Iceberg equality delete file MOR table apache/incubator-gluten#8055

Open

rui-mo reviewed Nov 26, 2024

View reviewed changes

liujiayi771 mentioned this pull request Nov 26, 2024

[WIP][GLUTEN-8055][VL] Support read Iceberg equality delete file MOR table apache/incubator-gluten#8056

Draft

yingsu00 force-pushed the iceberg9.3 branch 2 times, most recently from e9662e6 to e106b4c Compare December 3, 2024 22:07

Yuhta reviewed Dec 6, 2024

View reviewed changes

yingsu00 force-pushed the iceberg9.3 branch 3 times, most recently from 1b58a55 to 82b5f50 Compare December 12, 2024 04:22

feat(connector): Support reading Iceberg split with equality deletes

82b5f50

This commit introduces EqualityDeleteFileReader, which is used to read Iceberg splits with equality delete files. Co-authored-by: Naveen Kumar Mahadevuni <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(connector): Support reading Iceberg split with equality deletes #11088

feat(connector): Support reading Iceberg split with equality deletes #11088

yingsu00 commented Sep 25, 2024

netlify bot commented Sep 25, 2024 •

edited

Loading

yingsu00 commented Oct 11, 2024

czentgr commented Oct 24, 2024

yingsu00 commented Nov 13, 2024

rui-mo commented Nov 22, 2024 •

edited

Loading

FelixYBW commented Nov 26, 2024

liujiayi771 commented Nov 26, 2024 •

edited

Loading

rui-mo left a comment

rui-mo Nov 26, 2024

liujiayi771 Nov 26, 2024

yingsu00 Dec 4, 2024

rui-mo Dec 5, 2024

rui-mo Dec 5, 2024

liujiayi771 Dec 5, 2024

rui-mo Dec 10, 2024

Yuhta left a comment •

edited

Loading

Yuhta Dec 6, 2024

Yuhta Dec 6, 2024

Yuhta Dec 6, 2024

Yuhta Dec 6, 2024

Yuhta Dec 6, 2024

yingsu00 Dec 10, 2024

Yuhta Dec 6, 2024

		@@ -87,6 +90,8 @@ class SplitReader {

		void resetSplit();

		std::shared_ptr<const dwio::common::TypeWithId> baseFileSchema();

feat(connector): Support reading Iceberg split with equality deletes #11088

Are you sure you want to change the base?

feat(connector): Support reading Iceberg split with equality deletes #11088

Conversation

yingsu00 commented Sep 25, 2024

netlify bot commented Sep 25, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

yingsu00 commented Oct 11, 2024

czentgr commented Oct 24, 2024

yingsu00 commented Nov 13, 2024

rui-mo commented Nov 22, 2024 • edited Loading

FelixYBW commented Nov 26, 2024

liujiayi771 commented Nov 26, 2024 • edited Loading

rui-mo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yuhta left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

netlify bot commented Sep 25, 2024 •

edited

Loading

rui-mo commented Nov 22, 2024 •

edited

Loading

liujiayi771 commented Nov 26, 2024 •

edited

Loading

Yuhta left a comment •

edited

Loading