Skip to content

Consolidate Parquet Metadata handling into its own module and struct DFParquetMetadata #17127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Aug 11, 2025

Which issue does this PR close?

Rationale for this change

As suggested by @nuno-faria here: #17022 (comment)

The number of options and flags that are being passed around to the various metadata handling
function in the parquet code is getting somewhat out of hand

For example in #17022 from @shehabgamin a significant portion
of the PR is adding new options to existing functions to thread through the new options
and the tests. If we had this code organized better it would be easier to maintain and extend.

Also, as we use the caching more it is important to ensure it is used in all the right places.

What changes are included in this PR?

Proposal:

  1. Extract the options into a struct DFParquetMetadata
  2. Deprecate the old functions
  3. Update the functions / tests to create the struct

Are these changes tested?

yes, it is all covered by existing unit tests (changed in this PR)

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Aug 11, 2025
@alamb alamb changed the title Consolidate Parquet Metadata handling Consolidate Parquet Metadata handling into its own module and struct DFParquetMetadata Aug 11, 2025
@alamb alamb force-pushed the alamb/extract_parquet_metadata_handling branch from 8c2a99f to d993b04 Compare August 11, 2025 18:49
Some(ctx.runtime_env().cache_manager.get_file_metadata_cache()),
)
.await?;
let file_metadata_cache =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shows the key API difference -- instead of calling a bunch of free functions, you now construct a DFParquetMetadata and call methods on that struct instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks way cleaner now.

// Increases by 3 because cache has no entries yet
fetch_parquet_metadata(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new struct makes it much clearer what is being tested vs what is test setup functionality and I find the updated tests to be much easier to read

@@ -306,30 +301,6 @@ fn clear_metadata(
})
}

async fn fetch_schema_with_location(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this PR is moving code in this module into metadata.rs

@@ -1038,98 +1015,32 @@ impl MetadataFetch for ObjectStoreFetch<'_> {
/// through [`ParquetFileReaderFactory`].
///
/// [`ParquetFileReaderFactory`]: crate::ParquetFileReaderFactory
pub async fn fetch_parquet_metadata<F: MetadataFetch>(
fetch: F,
#[deprecated(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left all the existing public APIs and deprecated them, and updated them to call the new DFParquetMetadata structure

@@ -1935,40 +1688,9 @@ async fn output_single_parquet_file_parallelized(
Ok(file_metadata)
}

/// Min/max aggregation can take Dictionary encode input but always produces unpacked
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am quite please that most of the statistics handling is now consolidated into its own module

file_meta.object_meta.location,
))
})
// TODO should there be metadata prefetch hint here?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata prefetch hint isn't passed here (it isn't on main either) but this refactor leads me to believe it might be helpful to do so 🤔

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a user's perspective, I think it makes sense that the metadata prefetch option should apply everywhere metadata is fetched. It can be quite confusing when you change an option and either see no change at all (positive, negative, system resource usage etc.), or perhaps even worse, inconsistent change based on a specific workflow (e.g. "Why do queries for table X use twice the network hops, but table Y uses 50% more bandwidth?")

Copy link
Contributor Author

@alamb alamb Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, in theory it should be controlled by a config option: https://datafusion.apache.org/user-guide/configs.html

datafusion.execution.parquet.metadata_size_hint NULL

I haven't traced down why that one is not used here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time I only set the hint in inner = inner.with_footer_size_hint(hint), and then in get_metadata we would read it like so: reader.try_load(&mut self.inner, object_meta.size).await?;. Yes its better if we pass it to DFParquetMetadata.

@alamb alamb force-pushed the alamb/extract_parquet_metadata_handling branch from d993b04 to ef90d05 Compare August 11, 2025 19:03
@github-actions github-actions bot added the common Related to common crate label Aug 11, 2025
@alamb alamb marked this pull request as ready for review August 11, 2025 19:07
@nuno-faria
Copy link
Contributor

LGTM, its a much cleaner API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants