Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ArrowToParquetSchemaConverter, deprecate arrow_to_parquet_schema #6840

Merged
merged 21 commits into from
Dec 16, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Dec 5, 2024

Which issue does this PR close?

Rationale for this change

While working on the upgrade to arrow 54 in DataFusion apache/datafusion#13663 I found it was not clear what the coerce_types argument in arrow_to_parquet_schema meant or what value to pick to keep the existing behavior.

I had to go digging on #6313 to find what the default / old behavior was (true or false)

Turns out the answer is

    let schema_desc = arrow_to_parquet_schema(&schema, false)?;

It is clearly documented on WriterProperties but not on the lower level apis

Thus I would like to propose making a new API for converting schemas where the defaults are clearer and deprecate arrow_to_parquet_schema (and revert the unnecessary breaking API change)

What changes are included in this PR?

  1. Add ArrowToParquetSchemaConverter,
  2. Lots of comments / examples
  3. Revert the changes to arrow_to_parquet_schema made in feat(parquet)!: coerce_types flag for date64 #6313
  4. Deprecate arrow_to_parquet_schema with a clear message about what to use instread

Are there any user-facing changes?

A new API and a deprecation of an existing one

@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 5, 2024
@alamb alamb force-pushed the alamb/parquet_coverter branch from b0d96be to 07a754c Compare December 5, 2024 22:09
@alamb alamb marked this pull request as draft December 5, 2024 22:10
arrow_to_parquet_schema_with_root(schema, "arrow_schema", coerce_types)
}
#[deprecated(since = "54.0.0", note = "Use `ArrowToParquetSchemaConverter` instead")]
pub fn arrow_to_parquet_schema(schema: &Schema) -> Result<SchemaDescriptor> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


/// Convert arrow schema to parquet schema specifying the name of the root schema element
pub fn arrow_to_parquet_schema_with_root(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out this function is not actually exported (it is pub in this module, but not pub exported):

https://docs.rs/parquet/latest/parquet/?search=arrow_to_parquet_schema_with_root

Returns no results

The compiler told me it was unused once I switched everything over to use ArrowToParquetSchemaConverter

///
/// Setting this option to `true` will result in parquet files that can be
/// read by more readers, but may lose precision for arrow types such as
/// [`DataType::Date64`] which have no direct corresponding Parquet type.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also felt it would help if we described more explicitly what the impact of enabling this option was (and left a link to the longer backstory)

@alamb alamb force-pushed the alamb/parquet_coverter branch from 29d2509 to 97d8991 Compare December 5, 2024 22:39
@etseidl
Copy link
Contributor

etseidl commented Dec 5, 2024

@alamb could you take a look at #6828? Some of the doc changes will conflict, at least. +1 on the idea, though 😄

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big improvement, thanks!

@@ -225,27 +225,130 @@ pub(crate) fn add_encoded_arrow_schema_to_metadata(schema: &Schema, props: &mut
}
}

/// Convert arrow schema to parquet schema
/// Converter for arrow schema to parquet schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs are pretty inconsistent on whether "arrow" and "parquet" are capitalized or not. Do you have a preference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should strive for always capitalizing them Arrow and Parquet -- i will fix

parquet/src/arrow/schema/mod.rs Outdated Show resolved Hide resolved
parquet/src/arrow/schema/mod.rs Outdated Show resolved Hide resolved
parquet/src/arrow/schema/mod.rs Outdated Show resolved Hide resolved
parquet/src/arrow/schema/mod.rs Outdated Show resolved Hide resolved
Comment on lines 784 to 786
/// Setting this option to `true` will result in parquet files that can be
/// read by more readers, but may lose precision for arrow types such as
/// [`DataType::Date64`] which have no direct corresponding Parquet type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still mention the naming impacts here. There seem to be a number of users who want Parquet naming, and I think this is the first place they'd look to figure out how to do so. I'm not sure I'd think to follow the link to the converter when I want my lists named properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - I pushed e7b7d20 to try and clarify -- let me know what you think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Comment on lines 1490 to 1594
let converted_arrow_schema = arrow_to_parquet_schema(&arrow_schema, true).unwrap();
let converted_arrow_schema = ArrowToParquetSchemaConverter::new(&arrow_schema)
.with_coerce_types(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love how this makes what's going on more explicit.

@alamb alamb marked this pull request as ready for review December 6, 2024 14:45
Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review @etseidl

parquet/src/arrow/schema/mod.rs Outdated Show resolved Hide resolved
Comment on lines 784 to 786
/// Setting this option to `true` will result in parquet files that can be
/// read by more readers, but may lose precision for arrow types such as
/// [`DataType::Date64`] which have no direct corresponding Parquet type.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - I pushed e7b7d20 to try and clarify -- let me know what you think

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the doc CI failure to fix. Otherwise looks ready to go (mod capitalization which can be a follow on 😉). Thanks again @alamb.

parquet/src/file/properties.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor Author

alamb commented Dec 7, 2024

Just the doc CI failure to fix. Otherwise looks ready to go (mod capitalization which can be a follow on 😉). Thanks again @alamb.

Thank you 🙏

I fixed a few more capitalization inconsistencies and added a link to the date section of the parquet spec too

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did wonder if this should instead be an options struct that gets passed to a method (or has a member function that accepts the actual schema). This would allow using the same config for multiple files, and is perhaps more consistent with other APIs?

parquet/src/arrow/schema/mod.rs Show resolved Hide resolved
parquet/src/arrow/schema/mod.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor Author

alamb commented Dec 16, 2024

I did wonder if this should instead be an options struct that gets passed to a method (or has a member function that accepts the actual schema). This would allow using the same config for multiple files, and is perhaps more consistent with other APIs?

I agree, as there is no state held in the converter there is no reason for it to be a consuming builder. I made this change in 0ab4eb4

@alamb alamb merged commit 2808625 into apache:main Dec 16, 2024
16 checks passed
@alamb
Copy link
Contributor Author

alamb commented Dec 16, 2024

Thanks again @etseidl and @tustvold

@alamb alamb deleted the alamb/parquet_coverter branch December 16, 2024 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants