-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Filters on partition columns don't work | Spark 3.3.1 | com.crealytics:spark-excel_2.12:3.3.1_0.18.5 #727
Comments
Not sure if this is a typo, but afaik you need to use |
@nightscape apologies, that was a typo :) edited the original question |
Update: I downgraded the library to this is definitely a bug in the latest version on Spark 3.3.1 |
Ok, interesting! |
I did a temp workaround to temporary save it as a parquet and reload the dataframe as soon as I want to apply a filter: df.Write() df.Unpersist(); df = spark.Read() df = df.Filter("condition"); |
Is there an existing issue for this?
Current Behavior
There is some weird behaviour when filtering columns on a dataframe produced by the excel reader.
I have some excel files, partitioned in Azure Storage account and I am trying to fire a simple read from Databricks (Run time 12.1, Spark 3.3.1)
Example Path on Storage account -
/landing/excel/version=x/day=x
where version and day will become partition columns on readI have
version=1
andversion=2
andday=1
as sample partitions.Below read stores 2 rows into dataframe df
schema inferred
Now, if you filter on the
df
produced forversion=1
, it always returns all resultsdf.filter(col("version") === 1)
returns 2 rows (version =1 and version =2 )Also tried the following variants
df.filter(col("version") === lit(1))
anddf.filter($"version" === 1)
Try filtering on a value of
version
that doesn't exist, returns all rowsdf.filter(col("version") === 100)
returns 2 rowsNote: Filters on other normal columns work fine, so there seems to be something wrong on predicate pushdown
Expected Behavior
Filter on dataframe partition columns should return only rows from that partition
Steps To Reproduce
Environment
Anything else?
No response
The text was updated successfully, but these errors were encountered: