Skip to content

Conversation

@gaborkaszab
Copy link
Collaborator

No description provided.


/**
* @deprecated will be removed in 2.0.0; use {@link PlannedDataReader} instead.
* @deprecated will be removed in 1.12.0; use {@link PlannedDataReader} instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to ask you to check this with the community first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deprecating this earlier makes sense to me

@gaborkaszab
Copy link
Collaborator Author

FYI, If there is a community support for this, I can do the same changes in older Spark versions within the same PR.

@Test
public void testPositionDeletes() throws Exception {
@ParameterizedTest
@ValueSource(strings = {"avro", "parquet", "orc"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why we're changing the test here when switching to PlannedDataReader

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that the code I changed had no test coverage at all: we only tested rewriting Parquet pos deletes, but the change was on the Avro path. I can split that into a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm ok I see, thanks for the clarification. We should probably do the parameterization at the class level, since we'll need to run this test class also with different format versions. Btw there's also #14351 that is working on some aspects of this, so I might be good to sync up the efforts on this test class

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing me to that PR! I left my thoughts there. I can wait for that other one to be merged, and add the file format dimension after, or in case we do it in parallel someone will have to solve git conflicts, and maybe there is overlapping work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra I gave this a second thought: file format doesn't matter much for most of the tests in this suite, because they rewrite metadata files (only Avro atm) and some test also rewrite positional delete files. I observed that this suite runs pretty long, over 5 mins without parameterization, but once I added the 3 file formats as params, the runtime grew to almost 15 mins.
So I'd be careful how much test suite level dimension we introduce here as it can significantly increase runtime and probably doesn't add much to the test coverage. I'd prefer to add the file format param only to the tests that are relevant for pos deletes. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my main issue is that using @ParameterizedTest for one test and using @TestTemplate (we eventually want to run this test with format version 2 and 3, which is being added by #14351) for everything else is not compatible. Alternatively maybe do the parameterization differently by having a testPositionDeletes(String format) without a @Test annotation and then have @Test testPositionDeletesWithAvro() / @Test testPositionDeletesWithParquet() which both call testPositionDeletes(String format)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback!
I found 4 tests that use positional deletes. I changed them to exercise all 3 formats. Let me know if this is what you meant!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just having testPositionDeletes(String format) and run it across formats is fine IMO

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I reduced the scope of the test change to a single test.

@gaborkaszab gaborkaszab force-pushed the main_remove_usage_avro_datareader branch 4 times, most recently from 5f5fcc3 to 6e6ef08 Compare October 27, 2025 15:25
@gaborkaszab
Copy link
Collaborator Author

Thanks for taking a look @nastra , @huaxingao , @pvary !
Do you think I should add the same to earlier Spark version(s) within this PR or in a separate PR?
Spark 3.4 support is deprecated, however I see some people arguing to keep it for additional releases. So in order to be able to drop avro/DataReader in 1.12 (and keep Spark 3.4 support for longer) we still might want to backport this PR to that version too.

@gaborkaszab gaborkaszab force-pushed the main_remove_usage_avro_datareader branch from 6e6ef08 to ed2be06 Compare October 29, 2025 12:37
@gaborkaszab gaborkaszab requested a review from nastra October 30, 2025 13:22
@nastra nastra merged commit 8685985 into apache:main Oct 31, 2025
42 checks passed
@gaborkaszab
Copy link
Collaborator Author

Thanks for the reviews @nastra @huaxingao !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants