-
Notifications
You must be signed in to change notification settings - Fork 952
Finish implementing Variant::Object and Variant::List #7666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I merged up from main to resolve a conflict with this branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @scovich 🙏 -- this is looking quite close. I think we should fix the sorted dictionary thing and ensure the objects from parquet-testing
can be read before merging this PR but otherwise everything else could be done as a follow on
parquet-variant/src/variant.rs
Outdated
pub metadata: &'m VariantMetadata<'m>, | ||
pub value: &'v [u8], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't added in this PR but I think we should probably make these fields non public (we can add accessors or something) so we can
- Hint people use the nicer APIs
- Potentially change the implementation if needed
We can do this in some other PR too, I just wanted to point this out
pub metadata: &'m VariantMetadata<'m>, | ||
pub value: &'v [u8], | ||
header: VariantObjectHeader, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good design -- parse / validate the relevant fields from the header once and then save them to be used in subsequent passes
pub fn field(&self, name: &str) -> Result<Option<Variant<'m, 'v>>, ArrowError> { | ||
// Binary search through the sorted field IDs to find the field | ||
let (field_ids, field_offsets) = self.parse_field_arrays()?; | ||
let search_result = try_binary_search_by(&field_ids, &name, |&field_id| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is only correct to use binary search for the field name if the metadata has the fields sorted.
Perhaps for now we can just update this PR to return a NotYetImplemented error if the dictionary is not sorted.
let search_result = try_binary_search_by(&field_ids, &name, |&field_id| { | |
if !self.metadata.is_sorted() { | |
return Err(ArrowError::NotYetImplemented( | |
"Cannot search for fields in an unsorted VariantObject".to_string(), | |
)); | |
} | |
let search_result = try_binary_search_by(&field_ids, &name, |&field_id| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also confused me at first... I probably should have added a code comment.
This binary search is over the field names of the variant object itself, which are indirectly referenced via the metadata dictionary. And the spec does require those to be lexically ordered:
The field ids and field offsets must be in lexicographical order of the corresponding field names in the metadata dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(basically, if the requested field name actually exists, it must match the name referenced by one of the struct's field ids... and we can binary search them because those ids are in lexical order according to their backing dictionary entries)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: It would be incorrect to directly search the metadata dictionary, because that could "find" a field name that doesn't actually exist in the current object.
pub fn get(&self, index: usize) -> Result<Variant<'m, 'v>, ArrowError> { | ||
pub fn field(&self, name: &str) -> Result<Option<Variant<'m, 'v>>, ArrowError> { | ||
// Binary search through the sorted field IDs to find the field | ||
let (field_ids, field_offsets) = self.parse_field_arrays()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably optimize this code in a future PR -- I don't think we need to create a whole Vec<..> just to search
Maybe we can implement a OffsetSize::try_binary_search_by
type method that directly computes the offsets during the search.
Similarly, we could also add a OffsetSize::try_linear_search_by
method that directly does the linear search when the dictionary is not sorted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully agree this is not an optimal API -- I just implemented the existing stub methods to give a starting point we can iterate on.
As for try_linear_search_by
-- we may eventually need to define it as part of the work to support unshredding of shredded variants (because then we have to find the name of each shredded field in the possibly unordered dictionary), but I think we can defer that for now (see other comment thread).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully agree this is not an optimal API -- I just implemented the existing stub methods to give a starting point we can iterate on.
I think this is a good plan
parquet-variant/src/variant.rs
Outdated
} | ||
|
||
#[derive(Clone, Debug, PartialEq)] | ||
pub struct VariantListHeader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this needs to be pub
or if it is ok for it just to be pub(crate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO everything is too visible right now... it should all be pub(crate)
or less until we see actual reasons to make it pub
. But that seems like a piece of general follow-on work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup -- a follow on would be great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meanwhile, I made the new header structs and their methods pub(crate)
. We can fix the others later once use cases are clearer.
@@ -717,4 +980,257 @@ mod tests { | |||
"unexpected error: {err:?}" | |||
); | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some other things to test here would be:
- objects that have "is_large" set (aka have more than 256 distinct field names)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need 256+ field names in the test? Or just verify our ability to process the wider offsets correctly?
let last_field_offset_byte = | ||
field_offsets_start_byte + (num_elements + 1) * field_offset_size as usize; | ||
if last_field_offset_byte > value.len() { | ||
return Err(ArrowError::InvalidArgumentError(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be great eventually to cover these error cases with tests too (aka verify invalid inputs). I don't think it is needed for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed as #7681
parquet-variant/src/variant.rs
Outdated
// NOTE: We differ from the variant spec and call it "list" instead of "array" in order to be | ||
// consistent with parquet and arrow type naming. Otherwise, the name would conflict with the | ||
// `VariantArray : Array` we must eventually define for variant-typed arrow arrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// NOTE: We differ from the variant spec and call it "list" instead of "array" in order to be | |
// consistent with parquet and arrow type naming. Otherwise, the name would conflict with the | |
// `VariantArray : Array` we must eventually define for variant-typed arrow arrays. | |
/// Represents an Variant `Array` | |
/// | |
/// NOTE: The `List` naming differs from the variant spec, which uses "array" in order to be | |
/// consistent with parquet and arrow type naming. Otherwise, the name would conflict with the | |
/// `VariantArray : Array` we must eventually define for variant-typed arrow arrays. |
|
||
/// Returns the offset size in bytes | ||
pub fn offset_size(&self) -> usize { | ||
self.offset_size as _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we ever need to optimize the size of the VariantObjectHeader
we can potentially just store the header byte and re-extract the appropriate bits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OffsetSizeBytes
enum occupies 1 byte in practice, so I don't think we'd save much?
The offsets (2x usize
) will anyway dwarf it.
@@ -99,7 +99,7 @@ fn variant_non_primitive() -> Result<(), ArrowError> { | |||
assert_eq!(dict_val, "int_field"); | |||
} | |||
"array_primitive" => match variant { | |||
Variant::Array(arr) => { | |||
Variant::List(arr) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is important to make sure this code works for the pre-existing test data from parqet-testing
-- I will try and make a PR to update these tests and verify this PR's implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged it in, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a PR to your branch that I think shows this code works well
self.len() == 0 | ||
} | ||
|
||
pub fn values(&self) -> Result<impl Iterator<Item = Variant<'m, 'v>>, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I noticed here was that it would be really nice if:
- This returned an Iterator rather than a
Result
- We could implement
iter()
andIntoIterator
for VariantList
That would make using it more ergonomic, though it would require either panic'ing or else validating the offsets on construction 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is tricky. Ideally we wouldn't even materialize the result just to iterate over it, but that would require an iterator with Item = Result
which is yet more annoying.
Definitely not a fan of allowing untrusted input to cause a panic.
But if we want to make this method infallible, we'd need to pay the validation cost in the constructor.
So it seems like we have two choices:
- Pay O(n) to validate the offsets in the constructor, and
unwrap
here.- PRO: Cleaner API, allows to iterate without materializing the result first
- Keep as-is or even return an iterator of result instead of result of iterator
- PRO: Provably panic-free (no
unwrap
to reason about)
- PRO: Provably panic-free (no
If we went for 1/, I'd favor an internal method that returns an iterator of result. The constructor would instantiate the iterator and verify it's all-ok, at which point we know values
can safely invoke map(Result::unwrap)
because we already consumed the iterator once. Otherwise, I worry the checks in the constructor could diverge from the checks we trigger here, causing an unexpected panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think in general this is hitting the tension you mentioned in the earlier PRs of early vs late validation.
The more I think about it the more I think we should move the validation to construction because:
- The only reason to create a
Variant
in the first place is to access its data, so I think we would end up validating it almost immediately on read Variants
are constructed once but read many times so validating up front is probably faster- We could offer an
unchecked
variant for construction if performance overhead is hight that skips the validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich -- what do you recommend for next steps?
I would like to get it merged in sooner rather than later to minimize conflicts such as #7670 (review) |
I addressed most of your review points and pushed; the only big outstanding question is how to handle the iterators. But we can probably tackle that in a separate PR? |
// Binary search through the sorted field IDs to find the field | ||
// Binary search through the field IDs of this object to find the requested field name. | ||
// | ||
// NOTE: This does not require a sorted metadata dictionary, because the variant spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Let's get this one in and keep iterating. THank you @scovich |
Which issue does this PR close?
VariantObject::field
andVariantObject::fields
#7665Rationale for this change
Continuing the ongoing variant implementation effort.
What changes are included in this PR?
As per title -- implement fairly complete support for variant objects and arrays. Also add some unit tests.
Note: This PR renames
VariantArray
asVariantList
to align with parquet and arrow terminology, and to not conflict with theVariantArray
we will eventually need to define for holding an arrow array of variant-typed data.Are there any user-facing changes?
Those variant subtypes should now be usable.