Spark 3.5: Support default values in vectorized reads #11815

rdblue · 2024-12-18T22:51:10Z

This follows on #11803 and adds default value support to vectorized reads.

rdblue · 2024-12-18T22:53:32Z

...c/test/java/org/apache/iceberg/spark/data/parquet/vectorized/TestParquetVectorizedReads.java

@@ -49,7 +49,6 @@
 import org.apache.parquet.schema.MessageType;
 import org.apache.parquet.schema.Type;
 import org.apache.spark.sql.vectorized.ColumnarBatch;
-import org.junit.jupiter.api.Disabled;


I switched this to use assumptions like the other tests that are based on AvroDataTest. I just wanted to be consistent.

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java

nastra

+1 pending successful CI run

rdblue · 2024-12-19T16:40:00Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/data/AvroDataTest.java

@@ -248,17 +269,6 @@ public void testMixedTypes() throws IOException {
    writeAndValidate(schema);
  }

-  @Test
-  public void testTimestampWithoutZone() throws IOException {


Removing this test for TimestampNTZ by adding the type to SUPPORTED_PRIMITIVES (so that it is handled like any other primitive) is what broke the ORC tests. It looks like the problem is that Spark 3.5's ColumnarRow doesn't support TimestampNTZType. As a temporary work-around, I've added validation code that checks the value by accessing it as a TimestampType instead.

This isn't a change to read behavior, just how we access the data to validate it. I expect to be able to remove this workaround in the next Spark version.

yeah I noticed that too and was planning on fixing that in Spark. I've opened https://issues.apache.org/jira/browse/SPARK-50624

…rRow ### What changes were proposed in this pull request? Noticed that this was missing when using this in Iceberg. See additional details in apache/iceberg#11815 (comment) ### Why are the changes needed? To be able to read `TimestampNTZType` when using `ColumnarRow` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Added some unit tests that failed without the fix ### Was this patch authored or co-authored using generative AI tooling? No Closes #49437 from nastra/SPARK-50624. Authored-by: Eduard Tudenhoefner <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…rRow ### What changes were proposed in this pull request? Noticed that this was missing when using this in Iceberg. See additional details in apache/iceberg#11815 (comment) ### Why are the changes needed? To be able to read `TimestampNTZType` when using `ColumnarRow` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Added some unit tests that failed without the fix ### Was this patch authored or co-authored using generative AI tooling? No Closes #49437 from nastra/SPARK-50624. Authored-by: Eduard Tudenhoefner <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit d7545d0) Signed-off-by: Wenchen Fan <[email protected]>

Spark: Support default values in vectorized reads.

a54dd2e

rdblue requested review from nastra and Fokko December 18, 2024 22:51

github-actions bot added spark arrow labels Dec 18, 2024

rdblue commented Dec 18, 2024

View reviewed changes

Fokko approved these changes Dec 19, 2024

View reviewed changes

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java Outdated Show resolved Hide resolved

nastra approved these changes Dec 19, 2024

View reviewed changes

rdblue added 2 commits December 19, 2024 08:28

Remove unnecessary whitespace changes.

f21fe7e

Fix ORC tests for TimestampNTZType.

9c48b0b

rdblue commented Dec 19, 2024

View reviewed changes

rdblue merged commit 7033667 into apache:main Dec 19, 2024
43 checks passed

This was referenced Jan 10, 2025

[SPARK-50624][SQL] Add TimestampNTZType to ColumnarRow/MutableColumnarRow apache/spark#49437

Closed

[SPARK-50624][SQL] Add TimestampNTZType to ColumnarRow/MutableColumnarRow apache/spark#49244

Closed

rdblue added a commit to rdblue/iceberg that referenced this pull request Jan 17, 2025

Spark 3.4: Support default values in vectorized reads (apache#11815)

f09e61b

rdblue added a commit to rdblue/iceberg that referenced this pull request Jan 17, 2025

Spark 3.3: Support default values in vectorized reads (apache#11815)

815f227

rdblue added a commit to rdblue/iceberg that referenced this pull request Jan 17, 2025

Spark 3.3: Support default values in vectorized reads (apache#11815)

045fd7d

This was referenced Jan 17, 2025

Spark 3.4: Backport support for default values #11987

Merged

Spark 3.3: Backport support for default values #11988

Merged

rdblue added a commit to rdblue/iceberg that referenced this pull request Jan 17, 2025

Spark 3.4: Support default values in vectorized reads (apache#11815)

7efe77e

rdblue added a commit to rdblue/iceberg that referenced this pull request Jan 17, 2025

Spark 3.3: Support default values in vectorized reads (apache#11815)

0208adb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 3.5: Support default values in vectorized reads #11815

Spark 3.5: Support default values in vectorized reads #11815

rdblue commented Dec 18, 2024

rdblue Dec 18, 2024

nastra left a comment

rdblue Dec 19, 2024

nastra Dec 19, 2024 •

edited

Loading

Spark 3.5: Support default values in vectorized reads #11815

Spark 3.5: Support default values in vectorized reads #11815

Conversation

rdblue commented Dec 18, 2024

rdblue Dec 18, 2024

Choose a reason for hiding this comment

nastra left a comment

Choose a reason for hiding this comment

rdblue Dec 19, 2024

Choose a reason for hiding this comment

nastra Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

nastra Dec 19, 2024 •

edited

Loading