Skip to content

Convert JSON to VariantArray without copying #7911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 11, 2025

Draft while I pull out the benchmarks to a new PR:

Which issue does this PR close?

Rationale for this change

In a quest to have the fastest and most efficient Variant implementation I would like to avoid copies if at all possible
Right now, to make a VariantArray first requires completing an individual buffer and appending it
to the array.

Let's make that faster by having the VariantBuilder append directly into the buffer

What changes are included in this PR?

  1. Add VariantBuilder::new_from_existing
  2. Add a VariantArrayBuilder::variant_builder that reuses the buffers

TODO:

  • Tests for internal builder (both with and without calling drop)
  • Doc examples
  • Performance tests

Are these changes tested?

  1. New unit tests
  2. Yes by existing tests

Are there any user-facing changes?

Hopefully faster performance

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is remarkably simpler than I had imagined it would need to be. Handing off ownership back and forth was a very useful trick.

My only concern is whether we might ever need to support a builder that isn't backed by Vec? I'm guessing not, but wanted to double check.

Comment on lines 213 to 268
/// Note if you do not call finish, the struct will be reset and the buffers
/// will not be updated.
///
pub fn finish(mut self) {
let metadata_offset = self.metadata_offset;
let value_offset = self.value_offset;

// get the buffers back
let (metadata_buffer, value_buffer) = std::mem::take(&mut self.variant_builder).finish();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random tired Friday afternoon thoughts:

  • Now we have four possible levels of forgetting to finish. Maybe the protections we came up for the other three could apply here as well so we don't need a fancy impl Drop?
  • I also wonder if the BuilderExt concept could be helpful, given that this is really just a fourth thing that acts like a variant builder and you can stuff values into?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this sounds like a good idea

The fancy Drop impl is probably needed to give the ownership back to the VariantArray builder

The BuilderExt sounds like a good idea too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented the VariantBuilderExt and it looks good.

The lifetimes get a little wonky, so I still needed to allow access to the underlying builder. But maybe I can figure that out as a follwo on PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured out how to clean up the lifetimes of VariantBuilderExt so now it works quite well.

I also simplified the drop impl to try and make it clearer what is going on (and it avoids duplication in finish and drop which I think makes things much clearer)

@@ -315,29 +343,29 @@ impl MetadataBuilder {
let string_start = offset_start + (nkeys + 1) * offset_size as usize;
let metadata_size = string_start + total_dict_size;

let mut metadata = Vec::with_capacity(metadata_size);
metadata_buffer.reserve(metadata_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we reserve metadata_size more bytes, rather than that many total bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 174 to 205
let metadata_offset = metadata_buffer.len();
let value_offset = value_buffer.len();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: seems like these could just fold into the constructor call below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 262 to 310
metadata_buffer.truncate(self.metadata_offset);
value_buffer.truncate(self.value_offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

part of me observes that we don't need to truncate if we ensure the "garbage" bytes are a suffix of the previous entry rather than a prefix of the following entry. But it's probably better to not leave garbage lying around. Especially since garbage while attempting the first entry is necessarily a prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think truncate is very expensive

In this case all it is doing is rewinding the vec so it doesn't contain any bytes from the non finalized Variant builder

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point -- truncate doesn't free any memory, it just has to drop elements (but u8 doesn't have a destructor)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, thinking more about this, I wonder rollback-in-drop is a better approach than the one I had come up with. Just remember the starting pos on every buffer and revert back on unfinished drop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, thinking more about this, I wonder rollback-in-drop is a better approach than the one I had come up with. Just remember the starting pos on every buffer and revert back on unfinished drop.

yes, exactly, that is precise the behavior I was what I was trying to implement here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments to try and clarify this

@alamb
Copy link
Contributor Author

alamb commented Jul 12, 2025

My only concern is whether we might ever need to support a builder that isn't backed by Vec? I'm guessing not, but wanted to double check.

I think eventually we might, but I think the only way to really do so is via some sort of trait and a templated builder. I think we can pretty far without doing so and Vec seems to be the best / fastes / etc thing for managing owned memory in Rust

And there are zero copy APIs to/from Vec for the underlying Arrow arrays which I think is a pretty nice property too

@alamb alamb force-pushed the alamb/append_variant_builder branch from bb1502a to e07069d Compare July 12, 2025 20:31
}

/// Construct a ValueBuffer from an existing buffer
fn from_existing(existing: Vec<u8>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to just impl From and be done with it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented this change in #7912

alamb added a commit that referenced this pull request Jul 13, 2025
… buffers (#7912)

# Which issue does this PR close?

- closes #7805
- part of #6736
- part of #7911

# Rationale for this change

I would like to be able to write Variants directly into the target
buffer when writing multiple variants However, the current
VariantBuilder allocates a new bufffer for each variant

# What changes are included in this PR?

1. Add `VariantBuilder::new_with_buffers` and docs and tests

You can see how this API can be used to write directly into a buffer in
VariantArrayBuilder in this PR:
- #7911

# Are these changes tested?
Yes new tests

# Are there any user-facing changes?
New API
@alamb alamb force-pushed the alamb/append_variant_builder branch from e07069d to 8166cb6 Compare July 13, 2025 12:08
@github-actions github-actions bot removed the parquet Changes to the parquet crate label Jul 13, 2025
@alamb alamb force-pushed the alamb/append_variant_builder branch from 8166cb6 to e10d41d Compare July 13, 2025 12:29
@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 13, 2025
@alamb
Copy link
Contributor Author

alamb commented Jul 13, 2025

Update here is I think I have incorporated @scovich's comments and I am quite pleased with how it is looking

I think this code needs a few more tests and a benchmark or two and we'll be good.

I'll try and work on those in the next few days

@alamb alamb self-assigned this Jul 14, 2025
impl<'a> Drop for VariantArrayVariantBuilder<'a> {
/// If the builder was not finished, roll back any changes made to the
/// underlying buffers (by truncating them)
fn drop(&mut self) {
Copy link
Contributor

@scovich scovich Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this approach. I was thinking over the weekend that we may want to rework the other builders to follow a similar approach:

  • They can truncate the metadata dictionary on rollback, which would eliminate the false allocations that survive a rollback today
  • We can allocate the value bytes directly in the base buffer (instead of using a separate Vec)
    • On rollback, just truncate (like here)
    • On success, use Vec::splice to insert value offset and field id arrays, which slides over all the other bytes
  • Once we're using splice, it opens the door to pre-allocate the space for the value offset and field arrays, in case the caller knows how many fields or array elements there are.
    • If the prediction was correct, splice just replaces the pre-allocated space.
    • If incorrect, the pre-allocation is wasted (but we're no worse off than before -- the bytes just inject in)
    • The main complication would be guessing how many bytes to encode each offset with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can truncate the metadata dictionary on rollback, which would eliminate the false allocations that survive a rollback today

That is an excellent point

We can allocate the value bytes directly in the base buffer (instead of using a separate Vec)

That sounds like a great way to avoid the extra allocation

Once we're using splice, it opens the door to pre-allocate the space for the value offset and field arrays, in case the caller knows how many fields or array elements there are.

This is also a great idea 🤯

@alamb
Copy link
Contributor Author

alamb commented Jul 14, 2025

I added some benchmarks and my local results suggest that avoiding the allocations makes parsing small repeated json objects about 10% faster.

I think once we stop copying stuff around in the sub builders, the other bencmarks will be quite a bit faster too

Gnuplot not found, using plotters backend
small_repeated_json 8k string
                        time:   [5.3628 ms 5.3743 ms 5.3862 ms]
                        change: [−11.062% −10.867% −10.654%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Benchmarking small_random_json 8k string: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.2s, or reduce sample count to 60.
small_random_json 8k string
                        time:   [71.132 ms 71.245 ms 71.364 ms]
                        change: [−1.7432% −1.4915% −1.2496%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) high mild
  1 (1.00%) high severe

@@ -1047,16 +1047,16 @@ impl Drop for ObjectBuilder<'_> {
///
/// Allows users to append values to a [`VariantBuilder`], [`ListBuilder`] or
/// [`ObjectBuilder`]. using the same interface.
pub trait VariantBuilderExt<'m, 'v> {
fn append_value(&mut self, value: impl Into<Variant<'m, 'v>>);
pub trait VariantBuilderExt {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reason for the lifetimes to be attached to the trait itself -- if it is that means that the lifetimes trickle into the values -- since this trait is for actually constructing variant values (and copying the underlying bytes) I moved the lifetimes to just the arguments that need it

// TODO make this more efficient by avoiding the intermediate buffers
let mut variant_builder = VariantBuilder::new();
variant_builder.append_value(variant);
let (metadata, value) = variant_builder.finish();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of this PR is to avoid this copy here and instead write directly into the output

@alamb alamb force-pushed the alamb/append_variant_builder branch from 30ad86b to 3b6aef6 Compare July 16, 2025 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants