-
Notifications
You must be signed in to change notification settings - Fork 821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow JSON deserialization of StructArray from JSON List #6558
Labels
enhancement
Any new improvement worthy of a entry in the changelog
Comments
jagill
added
the
enhancement
Any new improvement worthy of a entry in the changelog
label
Oct 14, 2024
Provided this functionality is gated by an option on the reader that is disabled by default, and avoids regressing any of the existing benchmarks, this seems reasonable to support |
jagill
added a commit
to jagill/arrow-rs
that referenced
this issue
Oct 29, 2024
Currently, a StructArray can only be deserialized from a JSON object (e.g. `{a: 1, b: "c"}`), but some services (e.g. Presto and Trino) encode ROW types as JSON lists (e.g. `[1, "c"]`) because this is more compact, and the schema is known. This PR adds the ability to parse JSON lists into StructArrays, if the StructParseMode is set to ListOnly. In ListOnly mode, object-encoded structs raise an error. Setting to ObjectOnly (the default) has the original parsing behavior. Some notes/questions/points for discussion: 1. I've made a JsonParseMode struct instead of a bool flag for two reasons. One is that it's self-descriptive (what would `true` be?), and the other is that it allows a future Mixed mode that could deserialize either. The latter isn't currently requested by anyone. 2. I kept the error messages as similar to the old messages as possible. I considered having more specific error messages (like "Encountered a '[' when parsing a Struct, but the StructParseMode is ObjectOnly" or similar), but wanted to hear opinions before I went that route. 3. I'm not attached to any name/code-style/etc, so happy to modify to fit local conventions. Fixes apache#6558
jagill
added a commit
to jagill/arrow-rs
that referenced
this issue
Oct 29, 2024
Currently, a StructArray can only be deserialized from a JSON object (e.g. `{a: 1, b: "c"}`), but some services (e.g. Presto and Trino) encode ROW types as JSON lists (e.g. `[1, "c"]`) because this is more compact, and the schema is known. This PR adds the ability to parse JSON lists into StructArrays, if the StructParseMode is set to ListOnly. In ListOnly mode, object-encoded structs raise an error. Setting to ObjectOnly (the default) has the original parsing behavior. Some notes/questions/points for discussion: 1. I've made a JsonParseMode struct instead of a bool flag for two reasons. One is that it's self-descriptive (what would `true` be?), and the other is that it allows a future Mixed mode that could deserialize either. The latter isn't currently requested by anyone. 2. I kept the error messages as similar to the old messages as possible. I considered having more specific error messages (like "Encountered a '[' when parsing a Struct, but the StructParseMode is ObjectOnly" or similar), but wanted to hear opinions before I went that route. 3. I'm not attached to any name/code-style/etc, so happy to modify to fit local conventions. Fixes apache#6558
Added PR #6643 to implement this (or start a discussion). |
jagill
added a commit
to jagill/arrow-rs
that referenced
this issue
Dec 18, 2024
Currently, a StructArray can only be deserialized from a JSON object (e.g. `{a: 1, b: "c"}`), but some services (e.g. Presto and Trino) encode ROW types as JSON lists (e.g. `[1, "c"]`) because this is more compact, and the schema is known. This PR adds the ability to parse JSON lists into StructArrays, if the StructParseMode is set to ListOnly. In ListOnly mode, object-encoded structs raise an error. Setting to ObjectOnly (the default) has the original parsing behavior. Some notes/questions/points for discussion: 1. I've made a JsonParseMode struct instead of a bool flag for two reasons. One is that it's self-descriptive (what would `true` be?), and the other is that it allows a future Mixed mode that could deserialize either. The latter isn't currently requested by anyone. 2. I kept the error messages as similar to the old messages as possible. I considered having more specific error messages (like "Encountered a '[' when parsing a Struct, but the StructParseMode is ObjectOnly" or similar), but wanted to hear opinions before I went that route. 3. I'm not attached to any name/code-style/etc, so happy to modify to fit local conventions. Fixes apache#6558
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently, to deserialize a StructArray from JSON, you need to use a JSON Object. E.g., deserializing a Struct<a: i32, b: string> would need something like
{a: 1, b: "c"}
. This is also true of top-level RecordBatches. Some services, such as Presto and Trino, serialize ROW fields as lists. The example above would be serialized as[1, "c"]
. If you already know the schema, this is a more compact representation that reduces the data on the wire.I would like the ability for arrow_json to deserialize these list-encoded structs and record batches, perhaps under an option flag.
Describe the solution you'd like
When a StructArrayDecoder encounters a
[
(around here), it switches to a parsing mode that does not look for field names, and requires a closing]
for completion. It should return a parsing error if either the number of entries of the list is not the same as the number of fields in the struct, or if any of the sub-parsers encounter the wrong type. This requires the fields of the struct to be in the same order as the JSON List, while the current object parsing can shuffle the fields if they appear in a different order.Describe alternatives you've considered
I currently parse the results with serde_json, then recursively run down the JSON to convert Lists to Objects using the schema. Then I re-serialize the top-level JSON, then read it using arrow_json. This is not very efficient.
Alternatively, I could reproduce a less-good copy of the version of arrow_json that deserialized using serde_json. Either making serde_json decoders directly, or taking a serde_json::Value and populate the ArrayBuilders myself. This is a lot of code duplication.
Additional context
I don't personally need the ability to serialize a StructArray/RecordBatch into a List, although that would seem symmetrical.
I am happy to make an RFC PR implementing this functionality.
The text was updated successfully, but these errors were encountered: