Skip to content

Finish implementing Variant::Object and Variant::List #7666

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 17, 2025

Conversation

scovich
Copy link
Contributor

@scovich scovich commented Jun 13, 2025

Which issue does this PR close?

Rationale for this change

Continuing the ongoing variant implementation effort.

What changes are included in this PR?

As per title -- implement fairly complete support for variant objects and arrays. Also add some unit tests.

Note: This PR renames VariantArray as VariantList to align with parquet and arrow terminology, and to not conflict with the VariantArray we will eventually need to define for holding an arrow array of variant-typed data.

Are there any user-facing changes?

Those variant subtypes should now be usable.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 13, 2025
@scovich
Copy link
Contributor Author

scovich commented Jun 13, 2025

Attn @alamb @mkarbo since I can't request reviewers.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 16, 2025
@github-actions github-actions bot removed the arrow Changes to the arrow crate label Jun 16, 2025
@alamb
Copy link
Contributor

alamb commented Jun 16, 2025

I merged up from main to resolve a conflict with this branch

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @scovich 🙏 -- this is looking quite close. I think we should fix the sorted dictionary thing and ensure the objects from parquet-testing can be read before merging this PR but otherwise everything else could be done as a follow on

Comment on lines 386 to 387
pub metadata: &'m VariantMetadata<'m>,
pub value: &'v [u8],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't added in this PR but I think we should probably make these fields non public (we can add accessors or something) so we can

  1. Hint people use the nicer APIs
  2. Potentially change the implementation if needed

We can do this in some other PR too, I just wanted to point this out

pub metadata: &'m VariantMetadata<'m>,
pub value: &'v [u8],
header: VariantObjectHeader,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good design -- parse / validate the relevant fields from the header once and then save them to be used in subsequent passes

pub fn field(&self, name: &str) -> Result<Option<Variant<'m, 'v>>, ArrowError> {
// Binary search through the sorted field IDs to find the field
let (field_ids, field_offsets) = self.parse_field_arrays()?;
let search_result = try_binary_search_by(&field_ids, &name, |&field_id| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is only correct to use binary search for the field name if the metadata has the fields sorted.

Perhaps for now we can just update this PR to return a NotYetImplemented error if the dictionary is not sorted.

Suggested change
let search_result = try_binary_search_by(&field_ids, &name, |&field_id| {
if !self.metadata.is_sorted() {
return Err(ArrowError::NotYetImplemented(
"Cannot search for fields in an unsorted VariantObject".to_string(),
));
}
let search_result = try_binary_search_by(&field_ids, &name, |&field_id| {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also confused me at first... I probably should have added a code comment.

This binary search is over the field names of the variant object itself, which are indirectly referenced via the metadata dictionary. And the spec does require those to be lexically ordered:

The field ids and field offsets must be in lexicographical order of the corresponding field names in the metadata dictionary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(basically, if the requested field name actually exists, it must match the name referenced by one of the struct's field ids... and we can binary search them because those ids are in lexical order according to their backing dictionary entries)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: It would be incorrect to directly search the metadata dictionary, because that could "find" a field name that doesn't actually exist in the current object.

pub fn get(&self, index: usize) -> Result<Variant<'m, 'v>, ArrowError> {
pub fn field(&self, name: &str) -> Result<Option<Variant<'m, 'v>>, ArrowError> {
// Binary search through the sorted field IDs to find the field
let (field_ids, field_offsets) = self.parse_field_arrays()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably optimize this code in a future PR -- I don't think we need to create a whole Vec<..> just to search

Maybe we can implement a OffsetSize::try_binary_search_by type method that directly computes the offsets during the search.

Similarly, we could also add a OffsetSize::try_linear_search_by method that directly does the linear search when the dictionary is not sorted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree this is not an optimal API -- I just implemented the existing stub methods to give a starting point we can iterate on.

As for try_linear_search_by -- we may eventually need to define it as part of the work to support unshredding of shredded variants (because then we have to find the name of each shredded field in the possibly unordered dictionary), but I think we can defer that for now (see other comment thread).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree this is not an optimal API -- I just implemented the existing stub methods to give a starting point we can iterate on.

I think this is a good plan

}

#[derive(Clone, Debug, PartialEq)]
pub struct VariantListHeader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this needs to be pub or if it is ok for it just to be pub(crate)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO everything is too visible right now... it should all be pub(crate) or less until we see actual reasons to make it pub. But that seems like a piece of general follow-on work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup -- a follow on would be great

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meanwhile, I made the new header structs and their methods pub(crate). We can fix the others later once use cases are clearer.

@@ -717,4 +980,257 @@ mod tests {
"unexpected error: {err:?}"
);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other things to test here would be:

  1. objects that have "is_large" set (aka have more than 256 distinct field names)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need 256+ field names in the test? Or just verify our ability to process the wider offsets correctly?

let last_field_offset_byte =
field_offsets_start_byte + (num_elements + 1) * field_offset_size as usize;
if last_field_offset_byte > value.len() {
return Err(ArrowError::InvalidArgumentError(format!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great eventually to cover these error cases with tests too (aka verify invalid inputs). I don't think it is needed for this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed as #7681

Comment on lines 593 to 595
// NOTE: We differ from the variant spec and call it "list" instead of "array" in order to be
// consistent with parquet and arrow type naming. Otherwise, the name would conflict with the
// `VariantArray : Array` we must eventually define for variant-typed arrow arrays.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// NOTE: We differ from the variant spec and call it "list" instead of "array" in order to be
// consistent with parquet and arrow type naming. Otherwise, the name would conflict with the
// `VariantArray : Array` we must eventually define for variant-typed arrow arrays.
/// Represents an Variant `Array`
///
/// NOTE: The `List` naming differs from the variant spec, which uses "array" in order to be
/// consistent with parquet and arrow type naming. Otherwise, the name would conflict with the
/// `VariantArray : Array` we must eventually define for variant-typed arrow arrays.


/// Returns the offset size in bytes
pub fn offset_size(&self) -> usize {
self.offset_size as _
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we ever need to optimize the size of the VariantObjectHeader we can potentially just store the header byte and re-extract the appropriate bits

Copy link
Contributor Author

@scovich scovich Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OffsetSizeBytes enum occupies 1 byte in practice, so I don't think we'd save much?
The offsets (2x usize) will anyway dwarf it.

@@ -99,7 +99,7 @@ fn variant_non_primitive() -> Result<(), ArrowError> {
assert_eq!(dict_val, "int_field");
}
"array_primitive" => match variant {
Variant::Array(arr) => {
Variant::List(arr) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is important to make sure this code works for the pre-existing test data from parqet-testing -- I will try and make a PR to update these tests and verify this PR's implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged it in, thanks!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a PR to your branch that I think shows this code works well

self.len() == 0
}

pub fn values(&self) -> Result<impl Iterator<Item = Variant<'m, 'v>>, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I noticed here was that it would be really nice if:

  1. This returned an Iterator rather than a Result
  2. We could implement iter() and IntoIterator for VariantList

That would make using it more ergonomic, though it would require either panic'ing or else validating the offsets on construction 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is tricky. Ideally we wouldn't even materialize the result just to iterate over it, but that would require an iterator with Item = Result which is yet more annoying.

Definitely not a fan of allowing untrusted input to cause a panic.

But if we want to make this method infallible, we'd need to pay the validation cost in the constructor.

So it seems like we have two choices:

  1. Pay O(n) to validate the offsets in the constructor, and unwrap here.
    • PRO: Cleaner API, allows to iterate without materializing the result first
  2. Keep as-is or even return an iterator of result instead of result of iterator
    • PRO: Provably panic-free (no unwrap to reason about)

If we went for 1/, I'd favor an internal method that returns an iterator of result. The constructor would instantiate the iterator and verify it's all-ok, at which point we know values can safely invoke map(Result::unwrap) because we already consumed the iterator once. Otherwise, I worry the checks in the constructor could diverge from the checks we trigger here, causing an unexpected panic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think in general this is hitting the tension you mentioned in the earlier PRs of early vs late validation.

The more I think about it the more I think we should move the validation to construction because:

  1. The only reason to create a Variant in the first place is to access its data, so I think we would end up validating it almost immediately on read
  2. Variants are constructed once but read many times so validating up front is probably faster
  3. We could offer an unchecked variant for construction if performance overhead is hight that skips the validation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor

alamb commented Jun 16, 2025

@scovich -- what do you recommend for next steps?

  1. Shall I merge this PR and we can keep iterating on main?
  2. WOuld you like to work on it some more?

I would like to get it merged in sooner rather than later to minimize conflicts such as #7670 (review)

@scovich
Copy link
Contributor Author

scovich commented Jun 16, 2025

@scovich -- what do you recommend for next steps?

1. Shall I merge this PR and we can keep iterating on main?

2. WOuld you like to work on it some more?

I would like to get it merged in sooner rather than later to minimize conflicts such as #7670 (review)

I addressed most of your review points and pushed; the only big outstanding question is how to handle the iterators. But we can probably tackle that in a separate PR?

// Binary search through the sorted field IDs to find the field
// Binary search through the field IDs of this object to find the requested field name.
//
// NOTE: This does not require a sorted metadata dictionary, because the variant spec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

Let's get this one in and keep iterating. THank you @scovich

@alamb alamb merged commit f5f09ea into apache:main Jun 17, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant] Implement VariantObject::field and VariantObject::fields
2 participants