Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 3.5: Support default values in vectorized reads #11815

Merged
merged 3 commits into from
Dec 19, 2024

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Dec 18, 2024

This follows on #11803 and adds default value support to vectorized reads.

@@ -49,7 +49,6 @@
import org.apache.parquet.schema.MessageType;
import org.apache.parquet.schema.Type;
import org.apache.spark.sql.vectorized.ColumnarBatch;
import org.junit.jupiter.api.Disabled;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched this to use assumptions like the other tests that are based on AvroDataTest. I just wanted to be consistent.

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 pending successful CI run

@@ -248,17 +269,6 @@ public void testMixedTypes() throws IOException {
writeAndValidate(schema);
}

@Test
public void testTimestampWithoutZone() throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this test for TimestampNTZ by adding the type to SUPPORTED_PRIMITIVES (so that it is handled like any other primitive) is what broke the ORC tests. It looks like the problem is that Spark 3.5's ColumnarRow doesn't support TimestampNTZType. As a temporary work-around, I've added validation code that checks the value by accessing it as a TimestampType instead.

This isn't a change to read behavior, just how we access the data to validate it. I expect to be able to remove this workaround in the next Spark version.

Copy link
Contributor

@nastra nastra Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I noticed that too and was planning on fixing that in Spark. I've opened https://issues.apache.org/jira/browse/SPARK-50624

@rdblue rdblue merged commit 7033667 into apache:main Dec 19, 2024
43 checks passed
cloud-fan pushed a commit to apache/spark that referenced this pull request Jan 13, 2025
…rRow

### What changes were proposed in this pull request?

Noticed that this was missing when using this in Iceberg. See additional details in apache/iceberg#11815 (comment)

### Why are the changes needed?

To be able to read `TimestampNTZType` when using `ColumnarRow`

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

Added some unit tests that failed without the fix

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #49437 from nastra/SPARK-50624.

Authored-by: Eduard Tudenhoefner <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit to apache/spark that referenced this pull request Jan 13, 2025
…rRow

### What changes were proposed in this pull request?

Noticed that this was missing when using this in Iceberg. See additional details in apache/iceberg#11815 (comment)

### Why are the changes needed?

To be able to read `TimestampNTZType` when using `ColumnarRow`

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

Added some unit tests that failed without the fix

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #49437 from nastra/SPARK-50624.

Authored-by: Eduard Tudenhoefner <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit d7545d0)
Signed-off-by: Wenchen Fan <[email protected]>
rdblue added a commit to rdblue/iceberg that referenced this pull request Jan 17, 2025
rdblue added a commit to rdblue/iceberg that referenced this pull request Jan 17, 2025
rdblue added a commit to rdblue/iceberg that referenced this pull request Jan 17, 2025
rdblue added a commit to rdblue/iceberg that referenced this pull request Jan 17, 2025
rdblue added a commit to rdblue/iceberg that referenced this pull request Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants