Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concat_batches errors with "schema mismatch" error when only metadata differs #4799

Closed
alamb opened this issue Sep 8, 2023 · 3 comments · Fixed by #4815
Closed

concat_batches errors with "schema mismatch" error when only metadata differs #4799

alamb opened this issue Sep 8, 2023 · 3 comments · Fixed by #4815
Assignees
Labels
arrow Changes to the arrow crate bug

Comments

@alamb
Copy link
Contributor

alamb commented Sep 8, 2023

Describe the bug
When concatenating multiple RecordBatches together, if the batches differ only in metadata, concat_batches raises an error

To Reproduce
Run this test:

diff --git a/arrow-select/src/concat.rs b/arrow-select/src/concat.rs
index 31846ee1fd..045bb313bc 100644
--- a/arrow-select/src/concat.rs
+++ b/arrow-select/src/concat.rs
@@ -142,7 +142,7 @@ mod tests {
     use super::*;
     use arrow_array::cast::AsArray;
     use arrow_schema::{Field, Schema};
-    use std::sync::Arc;
+    use std::{sync::Arc, collections::HashMap};
 
     #[test]
     fn test_concat_empty_vec() {
@@ -604,6 +604,41 @@ mod tests {
         assert!(!new.values().to_data().ptr_eq(&com.values().to_data()));
     }
 
+    #[test]
+    fn concat_record_batches_different_metadata() {
+        let metadata = HashMap::from([("foo".to_string(), "bar".to_string())]);
+        let field = Field::new("a", DataType::Int32, false);
+
+        let schema1 = Arc::new(Schema::new(vec![
+            field.clone(),
+        ]));
+
+        let batch1 = RecordBatch::try_new(
+            schema1,
+            vec![
+                Arc::new(Int32Array::from(vec![1])),
+            ],
+        )
+        .unwrap();
+
+        let schema2 = Arc::new(Schema::new(vec![
+            field.with_metadata(metadata)
+        ]));
+
+        let batch2 = RecordBatch::try_new(
+            schema2,
+            vec![
+                Arc::new(Int32Array::from(vec![3])),
+            ],
+        )
+            .unwrap();
+
+        // should be able to concat batches with differnet metadata
+        let new_batch = concat_batches(&batch1.schema(), [&batch1, &batch2]).unwrap();
+        assert_eq!(new_batch.schema(), batch1.schema());
+        assert_eq!(2, new_batch.num_rows());
+    }
+
     #[test]
     fn concat_record_batches() {
         let schema = Arc::new(Schema::new(vec![

This fails with this error:

thread 'concat::tests::concat_record_batches_different_metadata' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidArgumentError("batches[1] schema is different with argument schema.\n            batches[1] schema: Schema { fields: [Field { name: \"a\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {\"foo\": \"bar\"} }], metadata: {} },\n            argument schema: Schema { fields: [Field { name: \"a\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }\n            ")', arrow-select/src/concat.rs:636:78

Expected behavior
I expect the test to pass

Additional context

@alamb alamb added the bug label Sep 8, 2023
@alamb alamb self-assigned this Sep 8, 2023
@alamb
Copy link
Contributor Author

alamb commented Sep 8, 2023

I will have a proposed PR for this shortly

@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'arrow'} from #4815

@tustvold tustvold added the arrow Changes to the arrow crate label Sep 18, 2023
@setop
Copy link

setop commented May 27, 2024

Same issue with version 51.0.0

From the one big CSV cut into two, I created the first partquet file with parquet-cpp-arrow

Metadata for file: E2021.parquet

version: 1
num of rows: 4283692
created by: parquet-cpp-arrow version 5.0.0
metadata:
  ARROW:schema: /////4ABAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAYAAAAUAQAA0AAAAJwAAABoAAAAOAAAAAQAAAAU////AAABAxAAAAAcAAAABAAAAAAAAAAFAAAAcHJpY2UABgAIAAYABgAAAAAAAgBE////AAABAhAAAAAUAAAABAAAAAAAAAADAAAAZGF5ADD///8AAAABQAAAAHD///8AAAECEAAAABgAAAAEAAAAAAAAAAUAAABtb250aAAAAGD///8AAAABQAAAAKD///8AAAECEAAAABgAAAAEAAAAAAAAAAQAAAB5ZWFyAAAAAJD///8AAAABQAAAAND///8AAAECEAAAABgAAAAEAAAAAAAAAAQAAABmdWVsAAAAAMD///8AAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABAhAAAAAgAAAABAAAAAAAAAAFAAAAcGR2aWQAAAAIAAwACAAHAAgAAAAAAAABQAAAAAAAAAA=
message schema {
  OPTIONAL INT64 pdvid;
  OPTIONAL INT64 fuel;
  OPTIONAL INT64 year;
  OPTIONAL INT64 month;
  OPTIONAL INT64 day;
  OPTIONAL DOUBLE price;
}

Then, using the same schema (message schema { ... } in a file), I created the second half with parquet-rs:

Metadata for file: E2022.parquet

version: 1
num of rows: 5044596
created by: parquet-rs version 51.0.0
metadata:
  ARROW:schema: /////5ABAAAQAAAAAAAKAAwACgAJAAQACgAAABAAAAAAAQQACAAIAAAABAAIAAAABAAAAAYAAAAoAQAA5AAAALAAAAB8AAAATAAAABQAAAAQABYAEAAOAA8ABAAAAAgAEAAAABgAAAAcAAAAAAABAxgAAAAAAAYACAAGAAYAAAAAAAIAAAAAAAUAAABwcmljZQAAAET///8QAAAAGAAAAAAAAQIUAAAANP///0AAAAAAAAABAAAAAAMAAABkYXkAcP///xAAAAAYAAAAAAABAhQAAABg////QAAAAAAAAAEAAAAABQAAAG1vbnRoAAAAoP///xAAAAAYAAAAAAABAhQAAACQ////QAAAAAAAAAEAAAAABAAAAHllYXIAAAAA0P///xAAAAAYAAAAAAABAhQAAADA////QAAAAAAAAAEAAAAABAAAAGZ1ZWwAAAAAEAAUABAADgAPAAQAAAAIABAAAAAYAAAAIAAAAAAAAQIcAAAACAAMAAQACwAIAAAAQAAAAAAAAAEAAAAABQAAAHBkdmlkAAAA
message arrow_schema {
  OPTIONAL INT64 pdvid;
  OPTIONAL INT64 fuel;
  OPTIONAL INT64 year;
  OPTIONAL INT64 month;
  OPTIONAL INT64 day;
  OPTIONAL DOUBLE price;
}

When I try to concat them, I get Error: General("inputs must have the same schema ...

The only diff is the naming of the schema, schema vs arrow_schema.

--- a.txt       2024-05-27 09:32:48.409232203 +0200
+++ b.txt       2024-05-27 09:32:55.073572763 +0200
@@ -1,6 +1,6 @@
 GroupType {
     basic_info: BasicTypeInfo {
-        name: \"schema\",
+        name: \"arrow_schema\",
         repetition: None,
         converted_type: NONE,
         logical_type: None,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug
Projects
None yet
3 participants