-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Consolidate Parquet Metadata handling into its own module and struct DFParquetMetadata
#17127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
DFParquetMetadata
8c2a99f
to
d993b04
Compare
Some(ctx.runtime_env().cache_manager.get_file_metadata_cache()), | ||
) | ||
.await?; | ||
let file_metadata_cache = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shows the key API difference -- instead of calling a bunch of free functions, you now construct a DFParquetMetadata
and call methods on that struct instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks way cleaner now.
// Increases by 3 because cache has no entries yet | ||
fetch_parquet_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the new struct makes it much clearer what is being tested vs what is test setup functionality and I find the updated tests to be much easier to read
@@ -306,30 +301,6 @@ fn clear_metadata( | |||
}) | |||
} | |||
|
|||
async fn fetch_schema_with_location( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of this PR is moving code in this module into metadata.rs
@@ -1038,98 +1015,32 @@ impl MetadataFetch for ObjectStoreFetch<'_> { | |||
/// through [`ParquetFileReaderFactory`]. | |||
/// | |||
/// [`ParquetFileReaderFactory`]: crate::ParquetFileReaderFactory | |||
pub async fn fetch_parquet_metadata<F: MetadataFetch>( | |||
fetch: F, | |||
#[deprecated( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left all the existing public APIs and deprecated them, and updated them to call the new DFParquetMetadata structure
@@ -1935,40 +1688,9 @@ async fn output_single_parquet_file_parallelized( | |||
Ok(file_metadata) | |||
} | |||
|
|||
/// Min/max aggregation can take Dictionary encode input but always produces unpacked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am quite please that most of the statistics handling is now consolidated into its own module
file_meta.object_meta.location, | ||
)) | ||
}) | ||
// TODO should there be metadata prefetch hint here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metadata prefetch hint isn't passed here (it isn't on main
either) but this refactor leads me to believe it might be helpful to do so 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a user's perspective, I think it makes sense that the metadata prefetch option should apply everywhere metadata is fetched. It can be quite confusing when you change an option and either see no change at all (positive, negative, system resource usage etc.), or perhaps even worse, inconsistent change based on a specific workflow (e.g. "Why do queries for table X use twice the network hops, but table Y uses 50% more bandwidth?")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, in theory it should be controlled by a config option: https://datafusion.apache.org/user-guide/configs.html
datafusion.execution.parquet.metadata_size_hint | NULL |
---|
I haven't traced down why that one is not used here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the time I only set the hint in inner = inner.with_footer_size_hint(hint)
, and then in get_metadata
we would read it like so: reader.try_load(&mut self.inner, object_meta.size).await?;
. Yes its better if we pass it to DFParquetMetadata
.
d993b04
to
ef90d05
Compare
LGTM, its a much cleaner API. |
Which issue does this PR close?
Rationale for this change
As suggested by @nuno-faria here: #17022 (comment)
The number of options and flags that are being passed around to the various metadata handling
function in the parquet code is getting somewhat out of hand
For example in #17022 from @shehabgamin a significant portion
of the PR is adding new options to existing functions to thread through the new options
and the tests. If we had this code organized better it would be easier to maintain and extend.
Also, as we use the caching more it is important to ensure it is used in all the right places.
What changes are included in this PR?
Proposal:
DFParquetMetadata
Are these changes tested?
yes, it is all covered by existing unit tests (changed in this PR)
Are there any user-facing changes?