-
Notifications
You must be signed in to change notification settings - Fork 821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ArrowToParquetSchemaConverter
, deprecate arrow_to_parquet_schema
#6840
Changes from 6 commits
97d8991
af74429
1fe379a
b519da4
2296b00
40ce622
1535b42
cec4f8d
1841283
e7b7d20
4e8c4d6
3ffd4d4
11ff55e
cf6ce05
bd4e2d5
afcb179
17267ec
8892810
0ab4eb4
df3072c
46e57d4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -225,27 +225,133 @@ pub(crate) fn add_encoded_arrow_schema_to_metadata(schema: &Schema, props: &mut | |
} | ||
} | ||
|
||
/// Convert arrow schema to parquet schema | ||
/// Converter for arrow schema to parquet schema | ||
/// | ||
/// The name of the root schema element defaults to `"arrow_schema"`, this can be | ||
/// overridden with [`arrow_to_parquet_schema_with_root`] | ||
pub fn arrow_to_parquet_schema(schema: &Schema, coerce_types: bool) -> Result<SchemaDescriptor> { | ||
arrow_to_parquet_schema_with_root(schema, "arrow_schema", coerce_types) | ||
/// Example: | ||
/// ``` | ||
/// # use std::sync::Arc; | ||
/// # use arrow_schema::{Field, Schema, DataType}; | ||
/// # use parquet::arrow::ArrowToParquetSchemaConverter; | ||
/// use parquet::schema::types::{SchemaDescriptor, Type}; | ||
/// use parquet::basic; // note there are two `Type`s in the following example | ||
/// let arrow_schema = Schema::new(vec![ | ||
/// Field::new("a", DataType::Int64, true), | ||
/// Field::new("b", DataType::Date32, true), | ||
/// ]); | ||
/// | ||
/// let parquet_schema = ArrowToParquetSchemaConverter::new(&arrow_schema) | ||
/// .build() | ||
/// .unwrap(); | ||
/// | ||
/// let expected_parquet_schema = SchemaDescriptor::new( | ||
/// Arc::new( | ||
/// Type::group_type_builder("arrow_schema") | ||
/// .with_fields(vec![ | ||
/// Arc::new( | ||
/// Type::primitive_type_builder("a", basic::Type::INT64) | ||
/// .build().unwrap() | ||
/// ), | ||
/// Arc::new( | ||
/// Type::primitive_type_builder("b", basic::Type::INT32) | ||
/// .with_converted_type(basic::ConvertedType::DATE) | ||
/// .with_logical_type(Some(basic::LogicalType::Date)) | ||
/// .build().unwrap() | ||
/// ), | ||
/// ]) | ||
/// .build().unwrap() | ||
/// ) | ||
/// ); | ||
/// | ||
/// assert_eq!(parquet_schema, expected_parquet_schema); | ||
/// ``` | ||
#[derive(Debug)] | ||
pub struct ArrowToParquetSchemaConverter<'a> { | ||
alamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
/// The schema to convert | ||
schema: &'a Schema, | ||
/// Name of the root schema in Parquet | ||
schema_root: &'a str, | ||
/// Should we Coerce arrow types to compatible Parquet types? | ||
/// | ||
/// See docs on [Self::with_coerce_types]` | ||
coerce_types: bool, | ||
} | ||
|
||
/// Convert arrow schema to parquet schema specifying the name of the root schema element | ||
pub fn arrow_to_parquet_schema_with_root( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It turns out this function is not actually exported (it is pub in this module, but not pub exported): https://docs.rs/parquet/latest/parquet/?search=arrow_to_parquet_schema_with_root Returns no results The compiler told me it was unused once I switched everything over to use |
||
schema: &Schema, | ||
root: &str, | ||
coerce_types: bool, | ||
) -> Result<SchemaDescriptor> { | ||
let fields = schema | ||
.fields() | ||
.iter() | ||
.map(|field| arrow_to_parquet_type(field, coerce_types).map(Arc::new)) | ||
.collect::<Result<_>>()?; | ||
let group = Type::group_type_builder(root).with_fields(fields).build()?; | ||
Ok(SchemaDescriptor::new(Arc::new(group))) | ||
impl<'a> ArrowToParquetSchemaConverter<'a> { | ||
/// Create a new converter | ||
pub fn new(schema: &'a Schema) -> Self { | ||
Self { | ||
schema, | ||
schema_root: "arrow_schema", | ||
coerce_types: false, | ||
} | ||
} | ||
|
||
/// Should arrow types be coerced into parquet native types (default `false`). | ||
/// | ||
/// Setting this option to `true` will result in parquet files that can be | ||
/// read by more readers, but may lose precision for arrow types such as | ||
/// [`DataType::Date64`] which have no direct corresponding Parquet type. | ||
/// | ||
/// By default, this converter does not coerce to native parquet types. Enabling type | ||
/// coercion allows for meaningful representations that do not require | ||
/// downstream readers to consider the embedded Arrow schema, and can allow | ||
/// for greater compatibility with other Parquet implementations. However, | ||
/// type coercion also prevents data from being losslessly round-tripped. | ||
/// | ||
/// # Discussion | ||
/// | ||
/// Some Arrow types such as `Date64`, `Timestamp` and `Interval` have no | ||
/// corresponding Parquet logical type. Thus, they can not be losslessly | ||
/// round-tripped when stored using the appropriate Parquet logical type. | ||
/// For example, some Date64 values may be truncated when stored with | ||
/// parquet's native 32 bit date type. | ||
/// | ||
/// For [`List`] and [`Map`] types, some | ||
/// Parquet readers expect certain schema elements to have specific names | ||
/// (earlier versions of the spec were somewhat ambiguous on this point). | ||
/// Type coercion will use the names prescribed by the Parquet specification, | ||
/// potentially losing naming metadata from the Arrow schema. | ||
/// | ||
/// [`List`]: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists | ||
/// [`Map`]: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps | ||
/// | ||
pub fn with_coerce_types(mut self, coerce_types: bool) -> Self { | ||
self.coerce_types = coerce_types; | ||
self | ||
} | ||
|
||
/// Set the root schema element name (defaults to `"arrow_schema"`). | ||
pub fn schema_root(mut self, schema_root: &'a str) -> Self { | ||
alamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
self.schema_root = schema_root; | ||
self | ||
} | ||
|
||
/// Build the desired parquet [`SchemaDescriptor`] | ||
pub fn build(self) -> Result<SchemaDescriptor> { | ||
let Self { | ||
schema, | ||
schema_root: root_schema_name, | ||
coerce_types, | ||
} = self; | ||
let fields = schema | ||
.fields() | ||
.iter() | ||
.map(|field| arrow_to_parquet_type(field, coerce_types).map(Arc::new)) | ||
.collect::<Result<_>>()?; | ||
let group = Type::group_type_builder(root_schema_name) | ||
.with_fields(fields) | ||
.build()?; | ||
Ok(SchemaDescriptor::new(Arc::new(group))) | ||
} | ||
} | ||
|
||
/// Convert arrow schema to parquet schema | ||
/// | ||
/// The name of the root schema element defaults to `"arrow_schema"`, this can be | ||
/// overridden with [`ArrowToParquetSchemaConverter`] | ||
#[deprecated(since = "54.0.0", note = "Use `ArrowToParquetSchemaConverter` instead")] | ||
pub fn arrow_to_parquet_schema(schema: &Schema) -> Result<SchemaDescriptor> { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This reverts the changes made in https://github.com/apache/arrow-rs/pull/6313/files#diff-6a684124e78f254fa6ea23b98f04be0a3ff5a2e4b24890d9c6f4dd850ba11333L232 |
||
ArrowToParquetSchemaConverter::new(schema).build() | ||
} | ||
|
||
fn parse_key_value_metadata( | ||
|
@@ -1487,7 +1593,10 @@ mod tests { | |
"; | ||
let parquet_group_type = parse_message_type(message_type).unwrap(); | ||
let parquet_schema = SchemaDescriptor::new(Arc::new(parquet_group_type)); | ||
let converted_arrow_schema = arrow_to_parquet_schema(&arrow_schema, true).unwrap(); | ||
let converted_arrow_schema = ArrowToParquetSchemaConverter::new(&arrow_schema) | ||
.with_coerce_types(true) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I love how this makes what's going on more explicit. |
||
.build() | ||
.unwrap(); | ||
assert_eq!( | ||
parquet_schema.columns().len(), | ||
converted_arrow_schema.columns().len() | ||
|
@@ -1511,7 +1620,10 @@ mod tests { | |
"; | ||
let parquet_group_type = parse_message_type(message_type).unwrap(); | ||
let parquet_schema = SchemaDescriptor::new(Arc::new(parquet_group_type)); | ||
let converted_arrow_schema = arrow_to_parquet_schema(&arrow_schema, false).unwrap(); | ||
let converted_arrow_schema = ArrowToParquetSchemaConverter::new(&arrow_schema) | ||
.with_coerce_types(false) | ||
.build() | ||
.unwrap(); | ||
assert_eq!( | ||
parquet_schema.columns().len(), | ||
converted_arrow_schema.columns().len() | ||
|
@@ -1667,7 +1779,9 @@ mod tests { | |
Field::new("decimal256", DataType::Decimal256(39, 2), false), | ||
]; | ||
let arrow_schema = Schema::new(arrow_fields); | ||
let converted_arrow_schema = arrow_to_parquet_schema(&arrow_schema, false).unwrap(); | ||
let converted_arrow_schema = ArrowToParquetSchemaConverter::new(&arrow_schema) | ||
.build() | ||
.unwrap(); | ||
|
||
assert_eq!( | ||
parquet_schema.columns().len(), | ||
|
@@ -1704,9 +1818,10 @@ mod tests { | |
false, | ||
)]; | ||
let arrow_schema = Schema::new(arrow_fields); | ||
let converted_arrow_schema = arrow_to_parquet_schema(&arrow_schema, true); | ||
let converted_arrow_schema = ArrowToParquetSchemaConverter::new(&arrow_schema) | ||
.with_coerce_types(true) | ||
.build(); | ||
|
||
assert!(converted_arrow_schema.is_err()); | ||
converted_arrow_schema.unwrap(); | ||
} | ||
|
||
|
@@ -1976,7 +2091,9 @@ mod tests { | |
// don't pass metadata so field ids are read from Parquet and not from serialized Arrow schema | ||
let arrow_schema = crate::arrow::parquet_to_arrow_schema(&schema_descriptor, None)?; | ||
|
||
let parq_schema_descr = crate::arrow::arrow_to_parquet_schema(&arrow_schema, true)?; | ||
let parq_schema_descr = crate::arrow::ArrowToParquetSchemaConverter::new(&arrow_schema) | ||
.with_coerce_types(true) | ||
.build()?; | ||
let parq_fields = parq_schema_descr.root_schema().get_fields(); | ||
assert_eq!(parq_fields.len(), 2); | ||
assert_eq!(parq_fields[0].get_basic_info().id(), 1); | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,14 +16,13 @@ | |
// under the License. | ||
|
||
//! Configuration via [`WriterProperties`] and [`ReaderProperties`] | ||
use std::str::FromStr; | ||
use std::{collections::HashMap, sync::Arc}; | ||
|
||
use crate::basic::{Compression, Encoding}; | ||
use crate::compression::{CodecOptions, CodecOptionsBuilder}; | ||
use crate::file::metadata::KeyValue; | ||
use crate::format::SortingColumn; | ||
use crate::schema::types::ColumnPath; | ||
use std::str::FromStr; | ||
use std::{collections::HashMap, sync::Arc}; | ||
|
||
/// Default value for [`WriterProperties::data_page_size_limit`] | ||
pub const DEFAULT_PAGE_SIZE: usize = 1024 * 1024; | ||
|
@@ -780,22 +779,16 @@ impl WriterPropertiesBuilder { | |
self | ||
} | ||
|
||
/// Sets flag to control if type coercion is enabled (defaults to `false`). | ||
/// Should the writer coerce types to parquet native types (defaults to `false`). | ||
/// | ||
/// # Notes | ||
/// Some Arrow types do not have a corresponding Parquet logical type. | ||
/// Affected Arrow data types include `Date64`, `Timestamp` and `Interval`. | ||
/// Also, for [`List`] and [`Map`] types, Parquet expects certain schema elements | ||
/// to have specific names to be considered fully compliant. | ||
/// Writers have the option to coerce these types and names to match those required | ||
/// by the Parquet specification. | ||
/// This type coercion allows for meaningful representations that do not require | ||
/// downstream readers to consider the embedded Arrow schema, and can allow for greater | ||
/// compatibility with other Parquet implementations. However, type | ||
/// coercion also prevents the data from being losslessly round-tripped. | ||
/// | ||
/// [`List`]: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists | ||
/// [`Map`]: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps | ||
/// Setting this option to `true` will result in parquet files that can be | ||
/// read by more readers, but may lose precision for arrow types such as | ||
/// [`DataType::Date64`] which have no direct corresponding Parquet type. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should still mention the naming impacts here. There seem to be a number of users who want Parquet naming, and I think this is the first place they'd look to figure out how to do so. I'm not sure I'd think to follow the link to the converter when I want my lists named properly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree - I pushed e7b7d20 to try and clarify -- let me know what you think There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💯 |
||
/// | ||
/// See [`ArrowToParquetSchemaConverter::with_coerce_types`] for more details | ||
/// | ||
/// [`DataType::Date64`]: arrow_schema::DataType::Date64 | ||
/// [`ArrowToParquetSchemaConverter::with_coerce_types`]: crate::arrow::ArrowToParquetSchemaConverter::with_coerce_types | ||
pub fn set_coerce_types(mut self, coerce_types: bool) -> Self { | ||
self.coerce_types = coerce_types; | ||
self | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs are pretty inconsistent on whether "arrow" and "parquet" are capitalized or not. Do you have a preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should strive for always capitalizing them
Arrow
andParquet
-- i will fix