[Gluten-core][VL] Supports Delta Lake Read #2902

Shirosakirukia · 2023-08-25T08:34:41Z

What changes were proposed in this pull request?

Supports Delta scan in Velox .
Delta 2.x supports Column Mapping, which is also supported in this PR.
Not support DeletionVector that is a new feature after Delta2.3

(Fixes: #ISSUE-2891)

How was this patch tested?

TPC-DS test

github-actions · 2023-08-25T08:34:59Z

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Other pull requests

github-actions · 2023-08-25T08:35:12Z

Run Gluten Clickhouse CI

felipepessoto · 2023-08-31T21:46:01Z

@Shirosakirukia, do you know why Delta doesn't just work as it is implemented as an extension of ParquetFileFormat?

For me it is not clear why some things work, for example it scans the correct set of Parquet files, instead of all the files in the folder, but some other doesn't.

You are re-implementing column mapping here, ideally, we shouldn't duplicate Delta implementation as it would be impossible to maintain it, and it would also miss many other features: optimize command, DV, reorg command, optimize write, auto compact, invariants, check constraints, etc.

YannByron · 2023-09-01T03:16:29Z

@felipepessoto We need to distinguish between these features (including OSS Delta or databricks Delta) and identify which ones need gluten/velox support. For example, some features related to optimize (auto-compaction, optimize write), is to redistribute data to files, and constraints only affect whether the coming data meets these constraints and how to deal with unqualified data. So IMO, these features don't need to taken int account when make gluten/velox supports DeltaLake.

While these features, like column-mapping and DV need. But the two features are still different. Essentially, column-mapping is just a mapping between table schema and file schema, So we can append ProjectExec before FileScanExec (as this pr) to make the native ParquetScan work for Delta Scan.
For DV, it's more complicated. Theoretically, we also transform Deletion Vector to a FilterExec that maybe has a bitmap, and put it before FileScanExec, but this is not a good way and also affects reading efficiency. So I prefer a solution that make velox to support DeltaScan with DV.

YannByron · 2023-09-01T03:20:54Z

@felipepessoto Based on this, we prefer to support Delta Column Mapping by rewriting plan. and support DV by velox supporting DeltaFileFormat later on.

github-actions · 2023-09-01T06:32:53Z

Run Gluten Clickhouse CI

github-actions · 2023-09-01T07:13:19Z

Run Gluten Clickhouse CI

github-actions · 2023-09-07T03:57:12Z

Run Gluten Clickhouse CI

YannByron · 2023-09-13T07:13:29Z

@zhouyuan may you take a look please.

github-actions · 2023-09-14T04:50:51Z

Run Gluten Clickhouse CI

felipepessoto · 2023-09-29T20:50:46Z

The Iceberg PR's can also provide some ideas:

Gluten
#3043

They implemented it differently, instead of changing FileSourceScanExecTransformer.scala to return ParquetReadFormat, they changed BatchScanExecTransformer.fileFormat to return ParquetReadFormat (or ORC in their case).

I wonder if we could use a simple approach for both cases. Idk which one is better though

Velox
facebookincubator/velox#5977 - facebookincubator/velox#5897

YannByron · 2023-10-07T06:36:15Z

@felipepessoto
I know the #3043. The key reason why the two pr implementations are different, is not the lake format (one for deltalake, one for iceberg), but the spark datasource interface used. DeltaLake uses Spark DS V1, while Iceberg uses Spark DS V2.

For spark datasource v2, we are working to provide a better design that should have a generic interface in gluten to support datasources (like iceberg and paimon) used spark DS v2, and be a nice project framework that makes easier to support more formats as @liujiayi771 said in #3043 (comment).

Look forward to your reply a lot.

felipepessoto · 2023-10-07T06:44:56Z

Got it. I don’t have much to add here. Just started with Gluten and still learning it.

Hope to see this merged soon as I use mostly Delta table.

felipepessoto · 2023-10-10T23:43:23Z

@Shirosakirukia build is failing. Any idea why it can't find Delta classes? Maybe you need to specify the version to use Delta 2.2.0, which is compatible to Spark 3.3

java.lang.NoClassDefFoundError: org/apache/spark/sql/delta/DeltaParquetFileFormat
at io.glutenproject.extension.RewritePlanIfNeeded.io$glutenproject$extension$RewritePlanIfNeeded$$isDeltaColumnMappingFileFormat(ColumnarOverrides.scala:70)
at io.glutenproject.extension.RewritePlanIfNeeded$$anonfun$apply$1.applyOrElse(ColumnarOverrides.scala:63)

Shirosakirukia · 2023-10-11T01:58:07Z

@felipepessoto Sure. Gluten-core build was successful with no functional issues. The error message indicates that the velox-backend test is unable to recognize the DeltaParquetFileFormat class. @YannByron May you take a look please?

YannByron · 2023-10-11T04:54:35Z

I will take over this pr soon, maybe open another to address the failure of CI, and add some UT.

github-actions · 2023-11-26T01:47:41Z

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions · 2023-12-07T01:45:55Z

This PR was auto-closed because it has been stalled for 10 days with no activity. Please feel free to reopen if it is still valid. Thanks.

[Gluten-core][VL] Supports Delta 2.2 Read

8631d33

YannByron mentioned this pull request Sep 1, 2023

[GLUTEN-2891][VL]fix: Add support to scan Delta Lake tables #2892

Closed

[Gluten-core] Support Delta scan

8a2b0d6

Merge branch 'main' into delta

08e76d2

fix code style

748f5f6

remove unused import

42f9503

YannByron mentioned this pull request Oct 11, 2023

[Gluten-core][VL] Supports DeltaLake 2.2 Read #3376

Closed

yma11 mentioned this pull request Oct 11, 2023

[VL] Unified design for data lake read support in Gluten + Velox #3378

Open

github-actions bot added the stale stale label Nov 26, 2023

github-actions bot closed this Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gluten-core][VL] Supports Delta Lake Read #2902

[Gluten-core][VL] Supports Delta Lake Read #2902

Shirosakirukia commented Aug 25, 2023 •

edited

Loading

github-actions bot commented Aug 25, 2023

github-actions bot commented Aug 25, 2023

felipepessoto commented Aug 31, 2023

YannByron commented Sep 1, 2023

YannByron commented Sep 1, 2023

github-actions bot commented Sep 1, 2023

github-actions bot commented Sep 1, 2023

github-actions bot commented Sep 7, 2023

YannByron commented Sep 13, 2023

github-actions bot commented Sep 14, 2023

felipepessoto commented Sep 29, 2023 •

edited

Loading

YannByron commented Oct 7, 2023 •

edited

Loading

felipepessoto commented Oct 7, 2023

felipepessoto commented Oct 10, 2023

Shirosakirukia commented Oct 11, 2023

YannByron commented Oct 11, 2023

github-actions bot commented Nov 26, 2023

github-actions bot commented Dec 7, 2023

[Gluten-core][VL] Supports Delta Lake Read #2902

[Gluten-core][VL] Supports Delta Lake Read #2902

Conversation

Shirosakirukia commented Aug 25, 2023 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Aug 25, 2023

github-actions bot commented Aug 25, 2023

felipepessoto commented Aug 31, 2023

YannByron commented Sep 1, 2023

YannByron commented Sep 1, 2023

github-actions bot commented Sep 1, 2023

github-actions bot commented Sep 1, 2023

github-actions bot commented Sep 7, 2023

YannByron commented Sep 13, 2023

github-actions bot commented Sep 14, 2023

felipepessoto commented Sep 29, 2023 • edited Loading

YannByron commented Oct 7, 2023 • edited Loading

felipepessoto commented Oct 7, 2023

felipepessoto commented Oct 10, 2023

Shirosakirukia commented Oct 11, 2023

YannByron commented Oct 11, 2023

github-actions bot commented Nov 26, 2023

github-actions bot commented Dec 7, 2023

Shirosakirukia commented Aug 25, 2023 •

edited

Loading

felipepessoto commented Sep 29, 2023 •

edited

Loading

YannByron commented Oct 7, 2023 •

edited

Loading