Spark 3.3: Backport support for default values #11988

rdblue · 2025-01-17T03:22:20Z

This backports support for default values from 3.5.

Each PR is backported as a separate commit: #11299, #11803, #11811, #11815, and #11832.

This contains the same changes as #11987.

rdblue · 2025-01-17T03:34:32Z

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/data/AvroDataTest.java

@@ -65,7 +91,7 @@ public abstract class AvroDataTest {
          required(117, "dec_38_10", Types.DecimalType.of(38, 10)) // Spark's maximum precision
          );

-  @Rule public TemporaryFolder temp = new TemporaryFolder();
+  @TempDir protected Path temp;


As I mentioned on the PR for 3.4, this JUnit 4 temp folder wasn't working for JUnit 5 parameterized tests. I made some tests independent of this (to keep the backport small) and ended up porting subclasses of AvroDataTest to JUnit 5 in a larger backport.

These test changes were the only significant deviations from the original PRs.

rdblue · 2025-01-17T03:36:52Z

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/data/AvroDataTest.java

+            DateTimeUtil.isoTimestamptzToMicros("2024-12-17T23:59:59.999999+00:00")),
+        // Arguments.of(
+        //     Types.TimestampType.withoutZone(),
+        //     DateTimeUtil.isoTimestampToMicros("2024-12-17T23:59:59.999999")),


Spark 3.3 doesn't support TimestampNTZ without a flag, so this 3.3 backport doesn't remove withSQLConf below or the testTimestampWithoutZone case. It also doesn't use TimestampType.withoutZone() in default tests or in tests that use SUPPORTED_PRIMITIVES.

manuzhang · 2025-01-17T03:58:59Z

Do we still want to back-port new features to Spark 3.3 given its support is deprecated?

Fokko · 2025-01-17T07:41:33Z

Do we still want to back-port new features to Spark 3.3 given its support is deprecated?

I think it is best to keep the Spark versions as close as possible 👍

manuzhang · 2025-01-17T12:56:38Z

Here's what we say on "Deprecated".

Deprecated: an engine version is no longer actively maintained. People who are still interested in the version can backport any necessary feature or bug fix from newer versions, but the community will not spend effort in achieving feature parity.

Isn't this PR to achieve feature parity?

amogh-jahagirdar

Yeah I think @manuzhang is technically correct, generally we wouldn't backport to 3.3.
We'd remove the 3.3 support anyways in the 1.9 release.

All that said, I'm not really opposed to getting it in for 3.3 (I'd say we should document this is supported for 3.3 though) unless there's strong objections? I'd also say going forward though we probably just want to be mindful of this, just to ensure we don't increase our maintenance burden.

rdblue · 2025-01-17T18:38:34Z

@manuzhang, I think this is a good idea. While we don't really expect people to use default values yet, Spark versions stay around a long time. Having this support helps ensure that there aren't correctness issues when people use this version with Spark 3.3 a few years from now. It's not strictly necessary, but since it wasn't very difficult (just porting the 3.4 changes) I thought it would be a good idea to do it.

If you're against it, we can discuss more.

manuzhang · 2025-01-21T00:44:14Z

@amogh-jahagirdar @rdblue I agree with your rational, but I'm confused about the criteria here. Shall we back-port other features from 3.4 / 3.5 since they are also nice and not difficult to have? It might also be confusing to contributors / users that the meaning of deprecation seems arbitrary.

rdblue · 2025-01-21T21:45:31Z

@manuzhang, this could be a correctness issue with Spark 3.3 and v3 tables, so I think it is an important fix. The language you're referencing is also trying to set expectations for other people, not limit what we will commit:

People who are still interested in the version can backport any necessary feature or bug fix from newer versions, but the community will not spend effort in achieving feature parity.

I'm the one interested in backporting this to avoid potential problems, but there should still not be an expectation that the Iceberg community will backport everything just because the branch is still there.

amogh-jahagirdar

Thanks for the explanation @rdblue , I missed the statement in the docs:

 People who are still interested in the version can backport any necessary feature or bug fix from newer versions, but the community will not spend effort in achieving feature parity

Given that, and that default values probably should go in to avoid any future correctness issues if people use this version with Spark 3.3, I think it makes sense to get this in.

github-actions bot added the spark label Jan 17, 2025

rdblue added 5 commits January 16, 2025 19:26

Spark 3.3: Update Spark to use planned Avro reads (apache#11299)

f0966d5

Spark 3.3: Support default values in Parquet reader (apache#11803)

a74c94c

Spark 3.3: Fix Parquet and Avro defaults date/time representation (ap…

a814b6b

…ache#11811)

Spark 3.3: Support default values in vectorized reads (apache#11815)

0208adb

Spark 3.3: Test reading default values in Spark (apache#11832)

8f4c8e5

rdblue force-pushed the spark-3.3-default-values branch from 526f807 to 8f4c8e5 Compare January 17, 2025 03:26

rdblue commented Jan 17, 2025

View reviewed changes

rdblue requested a review from amogh-jahagirdar January 17, 2025 03:37

rdblue added this to the Iceberg 1.8.0 milestone Jan 17, 2025

amogh-jahagirdar reviewed Jan 17, 2025

View reviewed changes

amogh-jahagirdar approved these changes Jan 21, 2025

View reviewed changes

amogh-jahagirdar merged commit 5b13760 into apache:main Jan 21, 2025
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 3.3: Backport support for default values #11988

Spark 3.3: Backport support for default values #11988

rdblue commented Jan 17, 2025

rdblue Jan 17, 2025

rdblue Jan 17, 2025

manuzhang commented Jan 17, 2025

Fokko commented Jan 17, 2025

manuzhang commented Jan 17, 2025

amogh-jahagirdar left a comment •

edited

Loading

rdblue commented Jan 17, 2025

manuzhang commented Jan 21, 2025

rdblue commented Jan 21, 2025

amogh-jahagirdar left a comment

Spark 3.3: Backport support for default values #11988

Spark 3.3: Backport support for default values #11988

Conversation

rdblue commented Jan 17, 2025

rdblue Jan 17, 2025

Choose a reason for hiding this comment

rdblue Jan 17, 2025

Choose a reason for hiding this comment

manuzhang commented Jan 17, 2025

Fokko commented Jan 17, 2025

manuzhang commented Jan 17, 2025

amogh-jahagirdar left a comment • edited Loading

Choose a reason for hiding this comment

rdblue commented Jan 17, 2025

manuzhang commented Jan 21, 2025

rdblue commented Jan 21, 2025

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

amogh-jahagirdar left a comment •

edited

Loading