Parquet: clear metadata and project fields of ParquetRecordBatchStream::schema #5135

Jefffrey · 2023-11-28T10:58:06Z

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Ensure metadata hashmap of Schema from ParquetRecordBatchStream::schema is empty, to ensure exact same schema as for its RecordBatches it outputs.

Are there any user-facing changes?

tustvold · 2023-11-28T11:20:05Z

Whilst I agree with the spirit of this, stripping this metadata would be a major breaking change that would definitely break IOx and possibly some other workloads. Perhaps we could just document this potential disparity?

This reverts commit 84be336.

Jefffrey · 2023-11-28T12:20:46Z

Whilst I agree with the spirit of this, stripping this metadata would be a major breaking change that would definitely break IOx and possibly some other workloads. Perhaps we could just document this potential disparity?

No problem, updated the docstring

tustvold · 2023-11-28T14:42:35Z

parquet/src/arrow/async_reader/mod.rs

+    /// Note that unlike its synchronous counterpart [`ParquetRecordBatchReader`], the [`SchemaRef`]
+    /// returned here will contain the original metadata, whereas [`ParquetRecordBatchReader`]
+    /// strips this metadata.


Are you sure about this, they use the same logic? The difference is that the returned RecordBatch lack the metadata

Yes, see how schema is created for ParquetRecordBatchReader here:

arrow-rs/parquet/src/arrow/arrow_reader/mod.rs

Line 609 in 093a10e

schema: Arc::new(Schema::new(levels.fields.clone())),

And here:

arrow-rs/parquet/src/arrow/arrow_reader/mod.rs

Lines 622 to 625 in 093a10e

let schema = match array_reader.get_data_type() {

ArrowType::Struct(ref fields) => Schema::new(fields.clone()),

_ => unreachable!("Struct array reader's data type is not struct!"),

};

They only account for the fields, ignoring the metadata.

tustvold · 2023-11-28T20:21:49Z

Ok sorry for going back and forth on this, I hadn't quite grasped what the issue here was

#[tokio::test]
async fn test_projection_schema() {
    let mut metadata = HashMap::with_capacity(1);
    metadata.insert("key".to_string(), "value".to_string());

    let schema = Arc::new(
        Schema::new(Fields::from(vec![
            Field::new("a", DataType::Int32, true),
            Field::new("c", DataType::UInt64, true),
            Field::new("d", DataType::Float32, true),
        ]))
        .with_metadata(metadata.clone()),
    );

    let mut file = tempfile().unwrap();
    let mut writer = ArrowWriter::try_new(&mut file, schema.clone(), None).unwrap();
    writer
        .write(&RecordBatch::new_empty(schema.clone()))
        .unwrap();
    writer.close().unwrap();

    let builder = ParquetRecordBatchReaderBuilder::try_new(file.try_clone().unwrap()).unwrap();
    let sync_file_schema = builder.schema().clone();
    let mask = ProjectionMask::leaves(&builder.parquet_schema(), [1, 2]);
    let reader = builder.with_projection(mask).build().unwrap();

    let sync_reader_schema = reader.schema();

    assert_eq!(sync_file_schema.fields.len(), 3);
    assert_eq!(sync_file_schema.metadata, metadata);
    assert_eq!(sync_reader_schema.fields.len(), 2);
    assert_eq!(sync_reader_schema.metadata, HashMap::default());

    let file = tokio::fs::File::from(file);
    let builder = ParquetRecordBatchStreamBuilder::new(file).await.unwrap();
    let async_file_schema = builder.schema().clone();
    let mask = ProjectionMask::leaves(&builder.parquet_schema(), [1, 2]);
    let reader = builder.with_projection(mask).build().unwrap();

    let async_reader_schema = reader.schema();

    assert_eq!(async_file_schema.fields.len(), 3);
    assert_eq!(async_file_schema.metadata, metadata);
    assert_eq!(async_reader_schema.fields.len(), 2); // Currently fails
    assert_eq!(async_reader_schema.metadata, HashMap::default()); // Currently fails
}

I think demonstrates the issue, in particular the schema returned by ParquetRecordBatchStream is incorrect, it should return the projected schema with the metadata removed.

Jefffrey · 2023-11-28T20:45:32Z

I'll revert back to the initial change, as well as account for projection which I wasn't aware it didn't account for either

tustvold · 2023-11-28T20:50:00Z

I'll revert back to the initial change, as well as account for projection which I wasn't aware it didn't account for either

I think if you determine the schema from the ParquetField on ReaderFactory (which should be a DataType::StructArray) I think you might be able to get both in one

This reverts commit ef9601e.

This reverts commit fd662ad.

…jection

Jefffrey · 2023-11-29T10:46:22Z

Re-implemented the metadata stripping for ParquetRecordBatchStream::schema and also ensure it respects the projected columns

tustvold · 2023-11-29T10:51:42Z

parquet/src/arrow/async_reader/mod.rs

@@ -385,13 +385,28 @@ impl<T: AsyncFileReader + Send + 'static> ParquetRecordBatchStreamBuilder<T> {
            offset: self.offset,
        };

+        // Ensure schema of ParquetRecordBatchStream respects projection, and does
+        // not store metadata (same as for ParquetRecordBatchReader and emitted RecordBatches)
+        let projected_fields = match reader.fields.as_deref().map(|pf| &pf.arrow_type) {


This isn't quite correct in the case of a projection of a nested schema, its possible this isn't actually determined until the ArrayReader is constructed...

Although https://docs.rs/parquet/latest/parquet/arrow/fn.parquet_to_arrow_field_levels.html

takes a ProjectionMask and should apply it already?

I was a bit worried about this, as I couldn't find a straightforward way that the schema was constructed from ParquetField + ProjectionMask, as it seems done in the ArrayReader construction indeed.

Edit: wasn't aware of https://docs.rs/parquet/latest/parquet/arrow/fn.parquet_to_arrow_field_levels.html, will check it out 👍

I'll have a quick play, stand by

…etadata

Jefffrey · 2023-11-30T12:33:01Z

I've reworked it to make use of the function introduced by #5149

Hopefully I've used it correctly here

Parquet: clear metadata of ParquetRecordBatchStream::schema

84be336

github-actions bot added the parquet Changes to the parquet crate label Nov 28, 2023

Jefffrey added 2 commits November 28, 2023 23:03

Revert "Parquet: clear metadata of ParquetRecordBatchStream::schema"

fd662ad

This reverts commit 84be336.

Document expected behaviour

ef9601e

tustvold reviewed Nov 28, 2023

View reviewed changes

tustvold mentioned this pull request Nov 28, 2023

ParquetRecordBatchStream Should Return the Projected Schema #4023

Closed

Jefffrey added 3 commits November 29, 2023 08:09

Revert "Document expected behaviour"

8e6489d

This reverts commit ef9601e.

Reapply "Parquet: clear metadata of ParquetRecordBatchStream::schema"

91f3780

This reverts commit fd662ad.

ParquetRecordBatchStream should strip schema metadata and respect pro…

af4f2d1

…jection

tustvold reviewed Nov 29, 2023

View reviewed changes

This was referenced Nov 29, 2023

Nested Schema Projection #5148

Closed

Support nested schema projection (#5148) #5149

Merged

Jefffrey added 2 commits November 30, 2023 22:25

Merge branch 'master' into clear_parquet_record_batch_stream_schema_m…

1d96883

…etadata

Fix projection of nested fields

7ea1e1b

Jefffrey changed the title ~~Parquet: clear metadata of ParquetRecordBatchStream::schema~~ Parquet: clear metadata and project fields of ParquetRecordBatchStream::schema Nov 30, 2023

tustvold approved these changes Dec 5, 2023

View reviewed changes

tustvold merged commit a36bf7a into apache:master Dec 5, 2023
17 checks passed

Jefffrey deleted the clear_parquet_record_batch_stream_schema_metadata branch December 5, 2023 12:33

kylebarron mentioned this pull request Aug 15, 2024

Ensure Parquet schema metadata is added to arrow table kylebarron/arro3#137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet: clear metadata and project fields of ParquetRecordBatchStream::schema #5135

Parquet: clear metadata and project fields of ParquetRecordBatchStream::schema #5135

Jefffrey commented Nov 28, 2023

tustvold commented Nov 28, 2023

Jefffrey commented Nov 28, 2023

tustvold Nov 28, 2023

Jefffrey Nov 28, 2023

tustvold commented Nov 28, 2023 •

edited

Loading

Jefffrey commented Nov 28, 2023

tustvold commented Nov 28, 2023

Jefffrey commented Nov 29, 2023

tustvold Nov 29, 2023

tustvold Nov 29, 2023

Jefffrey Nov 29, 2023 •

edited

Loading

tustvold Nov 29, 2023

Jefffrey commented Nov 30, 2023

	let schema = match array_reader.get_data_type() {
	ArrowType::Struct(ref fields) => Schema::new(fields.clone()),
	_ => unreachable!("Struct array reader's data type is not struct!"),
	};

Parquet: clear metadata and project fields of ParquetRecordBatchStream::schema #5135

Parquet: clear metadata and project fields of ParquetRecordBatchStream::schema #5135

Conversation

Jefffrey commented Nov 28, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold commented Nov 28, 2023

Jefffrey commented Nov 28, 2023

tustvold Nov 28, 2023

Choose a reason for hiding this comment

Jefffrey Nov 28, 2023

Choose a reason for hiding this comment

tustvold commented Nov 28, 2023 • edited Loading

Jefffrey commented Nov 28, 2023

tustvold commented Nov 28, 2023

Jefffrey commented Nov 29, 2023

tustvold Nov 29, 2023

Choose a reason for hiding this comment

tustvold Nov 29, 2023

Choose a reason for hiding this comment

Jefffrey Nov 29, 2023 • edited Loading

Choose a reason for hiding this comment

tustvold Nov 29, 2023

Choose a reason for hiding this comment

Jefffrey commented Nov 30, 2023

tustvold commented Nov 28, 2023 •

edited

Loading

Jefffrey Nov 29, 2023 •

edited

Loading