-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Gluten-core][VL] Supports Delta Lake Read #2902
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on Github Issues? https://github.com/oap-project/gluten/issues Then could you also rename commit message and pull request title in the following format?
See also: |
Run Gluten Clickhouse CI |
@Shirosakirukia, do you know why Delta doesn't just work as it is implemented as an extension of ParquetFileFormat? For me it is not clear why some things work, for example it scans the correct set of Parquet files, instead of all the files in the folder, but some other doesn't. You are re-implementing column mapping here, ideally, we shouldn't duplicate Delta implementation as it would be impossible to maintain it, and it would also miss many other features: optimize command, DV, reorg command, optimize write, auto compact, invariants, check constraints, etc. |
@felipepessoto We need to distinguish between these features (including OSS Delta or databricks Delta) and identify which ones need gluten/velox support. For example, some features related to While these features, like |
@felipepessoto Based on this, we prefer to support Delta Column Mapping by rewriting plan. and support |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
@zhouyuan may you take a look please. |
Run Gluten Clickhouse CI |
The Iceberg PR's can also provide some ideas: Gluten They implemented it differently, instead of changing FileSourceScanExecTransformer.scala to return ParquetReadFormat, they changed BatchScanExecTransformer.fileFormat to return ParquetReadFormat (or ORC in their case). I wonder if we could use a simple approach for both cases. Idk which one is better though Velox |
@felipepessoto For spark datasource v2, we are working to provide a better design that should have a generic interface in gluten to support datasources (like iceberg and paimon) used spark DS v2, and be a nice project framework that makes easier to support more formats as @liujiayi771 said in #3043 (comment). Look forward to your reply a lot. |
Got it. I don’t have much to add here. Just started with Gluten and still learning it. Hope to see this merged soon as I use mostly Delta table. |
@Shirosakirukia build is failing. Any idea why it can't find Delta classes? Maybe you need to specify the version to use Delta 2.2.0, which is compatible to Spark 3.3 java.lang.NoClassDefFoundError: org/apache/spark/sql/delta/DeltaParquetFileFormat |
@felipepessoto Sure. Gluten-core build was successful with no functional issues. The error message indicates that the velox-backend test is unable to recognize the DeltaParquetFileFormat class. @YannByron May you take a look please? |
I will take over this pr soon, maybe open another to address the failure of CI, and add some UT. |
This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
This PR was auto-closed because it has been stalled for 10 days with no activity. Please feel free to reopen if it is still valid. Thanks. |
What changes were proposed in this pull request?
(Fixes: #ISSUE-2891)
How was this patch tested?
TPC-DS test