Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Filters on partition columns not taking effect | Spark 3.5.0 | com.crealytics:spark-excel_2.12:3.5.0_0.20.2/3 and 3.5.1_0.20.4 #907

Open
2 tasks done
minnieshi opened this issue Dec 17, 2024 · 6 comments

Comments

@minnieshi
Copy link

minnieshi commented Dec 17, 2024

Am I using the newest version of the library?

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

The filter on a column, the partition folder, does not take effect on the below combination versions:

spark-excel_2.12-3.5.0_0.20.2 + Spark 3.5.0
spark-excel_2.12-3.5.0_0.20.3 + Spark 3.5.0
spark-excel_2.12-3.5.1_0.20.4 + Spark 3.5.0
(I did not list the 3.5.0_0.20.1 here as it has other issues which in older versions it had the same packing error
SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: excel
)

image

spark 3.5 meant databricks 15.4

image

The spark-excel library

image

databricks notebook (scala) filter code:

image

Expected Behavior

dataframe Filters work on partition folders
ps, the below version combinations work
spark-excel_2.12-3.2.4_0.20.4 + Spark 3.3.2
spark-excel_2.12-3.2.2_0.18.5 + Spark 3.3.2

image

Steps To Reproduce

see the notebook screenshot
val df = spark.read .format("excel") // for V2 implementation .option("dataAddress", "0!A3") // Optional, default: "A1" .option("header", "true") // Required .option("inferSchema", "true") // Optional, default: false .option("treatEmptyValuesAsNulls", "true") .load(excelPath)
also tried to filter using an integer
import org.apache.spark.sql.functions.col import org.apache.spark.sql.functions._ display(df.where(col("execution_date") === lit(20231218)).select("execution_date").distinct)
filter did not take effect
image

Environment

- Spark version:
- Spark-Excel version:
- OS:
- Cluster environment

Anything else?

No response

Copy link

Please check these potential duplicates:

3 similar comments
Copy link

Please check these potential duplicates:

Copy link

Please check these potential duplicates:

Copy link

Please check these potential duplicates:

@minnieshi
Copy link
Author

minnieshi commented Dec 17, 2024

It is not a duplicate; it is similar.
What do you think @nightscape , i can provide all the testing matrix notebook, which has code and rerun result if that helps.

excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.2.2_0.18.5 + Spark 3.4.1
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.2.2_0.18.5 + Spark 3.5.0
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.2.4_0.20.4 + Spark 3.4.1
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.2.4_0.20.4 + Spark 3.5.0
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.3.1_0.18.7 + Spark 3.5.0
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.3.2_0.19.0 + Spark 3.3.2
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.3.2_0.19.0 + Spark 3.4.1
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.3.2_0.19.0 + Spark 3.5.0
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.3.3_0.20.3 + Spark 3.5.0
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.3.3_0.20.3 + spark 3.4.1
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.3.4_0.20.4 + Spark 3.4.1
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.3.4_0.20.4 + Spark 3.5.0
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.4.1_0.19.0 + Spark 3.4.1
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.4.1_0.19.0 + Spark 3.5.0
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.4.1_0.20.1 + Spark 3.4.1
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.4.1_0.20.1 + Spark 3.5.0
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.4.1_0.20.2 + Spark 3.4.1
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.4.1_0.20.2 + Spark 3.5.0
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.4.1_0.20.3 + Spark 3.5.0
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.4.1_0.20.3 + spark 3.4.1
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.4.1_0.20.4 + Spark 3.5.0
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.4.1_0.20.4 + spark 3.4.1
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.4.3_0.20.4 + Spark 3.5.0
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.5.0_0.20.1 + Spark 3.5.0
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.5.0_0.20.2 + Spark 3.5.0
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.5.0_0.20.3 + Spark 3.5.0
excel_reader_filter_poc-ISSUES-com.crealytics:spark-excel_2.12-3.5.1_0.20.4 + Spark 3.5.0
excel_reader_filter_poc-WORKS-com.crealytics:spark-excel_2.12-3.2.2_0.18.5 + Spark 3.3.2
excel_reader_filter_poc-WORKS-com.crealytics:spark-excel_2.12-3.2.4_0.20.4 + Spark 3.3.2

@nightscape
Copy link
Owner

@minnieshi I guess there might have been some change in the internal handling of predicate push-down in 3.5.
That would be interesting to find out.
I had quite some success in a similar case by asking Perplexity to read the Spark changelogs for relevant entries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants