-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 3.5: Support default values in vectorized reads #11815
Spark 3.5: Support default values in vectorized reads #11815
Conversation
@@ -49,7 +49,6 @@ | |||
import org.apache.parquet.schema.MessageType; | |||
import org.apache.parquet.schema.Type; | |||
import org.apache.spark.sql.vectorized.ColumnarBatch; | |||
import org.junit.jupiter.api.Disabled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched this to use assumptions like the other tests that are based on AvroDataTest
. I just wanted to be consistent.
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 pending successful CI run
@@ -248,17 +269,6 @@ public void testMixedTypes() throws IOException { | |||
writeAndValidate(schema); | |||
} | |||
|
|||
@Test | |||
public void testTimestampWithoutZone() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this test for TimestampNTZ
by adding the type to SUPPORTED_PRIMITIVES
(so that it is handled like any other primitive) is what broke the ORC tests. It looks like the problem is that Spark 3.5's ColumnarRow
doesn't support TimestampNTZType
. As a temporary work-around, I've added validation code that checks the value by accessing it as a TimestampType
instead.
This isn't a change to read behavior, just how we access the data to validate it. I expect to be able to remove this workaround in the next Spark version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I noticed that too and was planning on fixing that in Spark. I've opened https://issues.apache.org/jira/browse/SPARK-50624
…rRow ### What changes were proposed in this pull request? Noticed that this was missing when using this in Iceberg. See additional details in apache/iceberg#11815 (comment) ### Why are the changes needed? To be able to read `TimestampNTZType` when using `ColumnarRow` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Added some unit tests that failed without the fix ### Was this patch authored or co-authored using generative AI tooling? No Closes #49437 from nastra/SPARK-50624. Authored-by: Eduard Tudenhoefner <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…rRow ### What changes were proposed in this pull request? Noticed that this was missing when using this in Iceberg. See additional details in apache/iceberg#11815 (comment) ### Why are the changes needed? To be able to read `TimestampNTZType` when using `ColumnarRow` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Added some unit tests that failed without the fix ### Was this patch authored or co-authored using generative AI tooling? No Closes #49437 from nastra/SPARK-50624. Authored-by: Eduard Tudenhoefner <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit d7545d0) Signed-off-by: Wenchen Fan <[email protected]>
This follows on #11803 and adds default value support to vectorized reads.