Add `shrink_to_fit` to `Array` #6360

teh-cmc · 2024-09-05T13:39:33Z

We keep a lot of Arrow data in long-lived memory storage, and therefore must be guaranteed that the data has been optimized for space before it gets committed to storage, regardless of how it got built or how it got there.

This would also provide a strongly typed non-trait version, similar to what is done e.g. for slice.

See this PR for more background context:

Fix MutableBuffer::into_buffer leaking its extra capacity into the final buffer #6300

The text was updated successfully, but these errors were encountered:

emilk · 2024-11-21T19:53:23Z

I've been thinking a bit on how to implement this. It seems to be a lot of work, but doable.

Ideally we should be able to take any ArrayRef (Arc<dyn Array>) and slim it down using a shrink_to_fit method.
The desired capacity should probably be the required size rounded up to an even multiple of 64 bytes (following the lead of MutableBuffer::shrink_to_fit).

One thing we need to ensure is to only allocate and copy memory if the desired capacity is lower than the current capacity. That is: shrink_to_fit should be idempotent, and fast on the second call.

`fn shrink_to_fit(&self) -> ArrayRef`

In #6300 (comment), @tustvold wrote:

I think that would be generally useful, and would also handle the case of sliced arrays. I'd recommend something that returns a new Array, e.g. fn shrink_to_fit(&self) -> ArrayRef.

In other words:

trait Array {
    …
    /// Return a clone of self using the minimal amount of memory,
    /// rounded up to the nearest multiple of 64 bytes.
    ///
    /// If the array is already using a minimal amount of memory, this returns a clone,
    /// which is usually cheap because everything bottoms out into a [`Buffer`] which is just a ref-counted [`Bytes`].
    fn shrink_to_fit(&self) -> ArrayRef;
}

This would still require a small memory-allocation for each call (a call to Arc::new), but the actual data would only need copying when there is actual opportunity for slimming down (e.g. when there is extra capacity and/or when the Array is a slice of some bigger chunk of memory).

In practice, this would look something like this:

impl Array for BooleanArray {
    …
    fn shrink_to_fit(&self) -> ArrayRef {
        Arc::new(Self{
            values: self.values.shrink_to_fit(),
            nulls: self.nulls.as_ref().map(|b| b.shrink_to_fit()),
        }}
    }
}

…assuming we also implement fn shrink_to_fit(&self) -> Self on BooleanBuffer, NullBuffer etc, which all bottom out in implementing

fn Buffer {
    fn shrik_to_fit(&self) -> Self {
        let new_capacity = bit_util::round_upto_multiple_of_64(self.len());
        if self.data.capacity <= new_capacity {
            self.clone() // fast path
        } else {
            // allocate new memory (`Bytes`) using the `std::alloc::alloc` and copy to it
        }
    }
}

Upsides

You can call .shrink_to_fit on basically anything (all arrays and all buffers).
It's all immutable by default.

Downsides

Each call will require at least one memory allocation, even in the fast case.
For something like StructArray it will require more, since even a shallow clone of StructArray requires cloning

arrow-rs/arrow-array/src/array/struct_array.rs

Line 81 in 853626e

fields: Vec<ArrayRef>,

There is not easy way to opt-out of this, though a user could use a heuristic like:

fn maybe_shrink(array: ArrayRef) -> ArrayRef {
    if array.get_buffer_memory_size() < 1024 {
        array // not worth the overhead of calling shrink_to_fit
    } else {
        array.shrink_to_fit()
    }
}

Perhaps this is good enough?

Alternative designs

I don't really have any, except for the one in #6300

I am new to the code base though, so I might be missing something.

Request for feedback

Thoughts @alamb @tustvold @teh-cmc ?

If the fn shrink_to_fit(&self) -> ArrayRef plan sounds good, I can start implementing it.

alamb · 2024-11-22T18:53:03Z

Each call will require at least one memory allocation, even in the fast case.

What if you made shrink_to_fit consuming. Something like this:

impl Array for BooleanArray {
    …
    fn shrink_to_fit(self) -> Self {
        Self{
            values: self.values.shrink_to_fit(),
            nulls: self.nulls.as_ref().map(|b| b.shrink_to_fit()),
        }
    }
}

Then if you already had an owned array no clone would be required at all.

It would take some finagling to get this out of an ArrayRef - maybe something like

let arr: ArrayRef  = ...;
if let bool_array_arc: Arc<BooleanArray> = arr.downcast() {
  bool_array_arc.unwrap_or_clone().shrink_to_fit()
}

Though I suppose you still need to rewrap into an Arc to get an array ref 🤔

alamb · 2024-11-22T18:54:31Z

If you made the API consuming then you wouldn't force a clone if you were working with the array types directly (though if you started with an ArrayRef and wanted to end with an ArrayRef it would probably end up with the same amount of allocations as your proposal)

emilk · 2024-11-25T10:33:29Z

You are right, a consuming API seems better, on all levels. I'll probably start work on this later this week.

emilk · 2024-11-25T11:39:13Z

Actually, no. We can make shrink_to_fit consuming everywhere except on trait Array, but fn shrink_to_fit(self) -> Self is not object safe.

tustvold · 2024-11-25T11:49:16Z

I wonder if we should give some thought to how this may integrate with types such as DictionaryArray or StringViewArray that can be compacted/GC.

I worry a little that a shrink_to_fit API might be constraining, and whether we would be better adding a first-class compact kernel within e.g. arrow-select? This would then be able to take an options struct to control its behaviour.

emilk · 2024-11-25T13:08:15Z

The actual problem that we need to solve is the capacity of repeated concat:ination. (details: #6300 (comment) and rerun-io/rerun#7222). This is currently a blocker for us for switching away from arrow2 to arrow-rs.

I don't know the arrow codebase well enough to even know what arrow-select is. concat lives there, so maybe it makes sense for shrink_to_fit to live there too? But how would it work in practice - dynamically downcasting a dyn Array to all possible subtypes that we have implemented shrink_to_fit for? Is there something similar I can take a look at and use as a template?

tustvold · 2024-11-25T13:29:05Z

that we need to solve is the capacity of repeated concat:ination

It sounds like #6692 might be what you're actually wanting as a long-term API, although this will require someone very familiar with the arrow codebase to meaningfully action.

But how would it work in practice - dynamically downcasting a dyn Array to all possible subtypes that we have implemented shrink_to_fit for?

Pretty much, there are macros to make this slightly less obnoxious.

Is there something similar I can take a look at and use as a template?

Pretty much all the kernels do some variation of this

That all being said, shrink_to_fit can just implement the basic logic, and if/when we add a compact kernel that can simply call shrink_to_fit. This could allow progress to be made on this now.

Regarding the consuming vs non-consuming versions, etc... one idea might be to just have:

/// Shrink any unshared buffers
fn shrink_to_fit(&mut self);

And then use Arc::get_mut to chain through reference counts, this would prohibit shrinking a shared array, I couldn't tell from the links whether the desire is to get a compact deep clone, or just a best-effort optimisation of unshared buffers

emilk · 2024-11-25T13:34:43Z

I have a working version of fn shrink_to_fit(&self) -> ArrayRef now in #6787 (PTAL!). However, it has the (big) drawback that I failed to implement it for TypedDictionaryArray and TypedRunArray, which are types wrapping references.

fn shrink_to_fit(&mut self) may indeed be a better approach.

emilk · 2024-11-25T15:00:48Z

I think trait Array { fn shrink_to_fit(&mut self) {} } would also require having trait Array { fn clone(&self) -> ArrayRef; } so that shrink_to_fit can be implemented for ArrayRef in the case of a shared reference

tustvold · 2024-11-25T15:56:32Z

Yeah, I thought we might be able to get around this using Arc::get_mut for ArrayRef, but this won't work for the typed wrappers which only have access to an immutable ref.

I think a compact kernel may be the only consistent way to proceed, it is certainly the approach that is more consistent with other functionality

emilk · 2024-11-25T16:03:18Z

The new approach of fn shrink_to_fit(&mut self) works well: #6790

The only thing left to implement is to handle the case of a shared ArrayRef, which I see three paths forward on:

A) ignore it (if it is shared, we can't free up memory)
B) Clone the contents with a call to self.slice(0, self.len()) and then call .shrink_to_fit on the results (will only work if slice returns a non-shared ArrayRef)
C) Add trait Table { fn clone(&self) -> Box<dyn Array> and then call that in the shared case, and call .shrink_to_fit on the result (bullet-proof).

I'm actually happy with any of these three paths.

Please take a look at #6790 and tell me what you think. I find it a rather clean approach, as it doesn't require the dynamic downcasting of arrow-select

This was referenced Sep 5, 2024

Fix MutableBuffer::into_buffer leaking its extra capacity into the final buffer #6300

Closed

Migrate from re_arrow2 to arrow rerun-io/rerun#3741

Open

emilk mentioned this issue Nov 25, 2024

Add Array::shrink_to_fit(&self) -> ArrayRef #6787

Closed

emilk mentioned this issue Nov 25, 2024

Add Array::shrink_to_fit(&mut self) #6790

Merged

tustvold closed this as completed in #6790 Nov 29, 2024

alamb added the arrow Changes to the arrow crate label Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `shrink_to_fit` to `Array` #6360

Add `shrink_to_fit` to `Array` #6360

teh-cmc commented Sep 5, 2024

emilk commented Nov 21, 2024 •

edited

Loading

alamb commented Nov 22, 2024

alamb commented Nov 22, 2024

emilk commented Nov 25, 2024

emilk commented Nov 25, 2024 •

edited

Loading

tustvold commented Nov 25, 2024 •

edited

Loading

emilk commented Nov 25, 2024 •

edited

Loading

tustvold commented Nov 25, 2024 •

edited

Loading

emilk commented Nov 25, 2024 •

edited

Loading

emilk commented Nov 25, 2024

tustvold commented Nov 25, 2024

emilk commented Nov 25, 2024

Add shrink_to_fit to Array #6360

Add shrink_to_fit to Array #6360

Comments

teh-cmc commented Sep 5, 2024

emilk commented Nov 21, 2024 • edited Loading

fn shrink_to_fit(&self) -> ArrayRef

Upsides

Downsides

Alternative designs

Request for feedback

alamb commented Nov 22, 2024

alamb commented Nov 22, 2024

emilk commented Nov 25, 2024

emilk commented Nov 25, 2024 • edited Loading

tustvold commented Nov 25, 2024 • edited Loading

emilk commented Nov 25, 2024 • edited Loading

tustvold commented Nov 25, 2024 • edited Loading

emilk commented Nov 25, 2024 • edited Loading

emilk commented Nov 25, 2024

tustvold commented Nov 25, 2024

emilk commented Nov 25, 2024

Add `shrink_to_fit` to `Array` #6360

Add `shrink_to_fit` to `Array` #6360

emilk commented Nov 21, 2024 •

edited

Loading

`fn shrink_to_fit(&self) -> ArrayRef`

emilk commented Nov 25, 2024 •

edited

Loading

tustvold commented Nov 25, 2024 •

edited

Loading

emilk commented Nov 25, 2024 •

edited

Loading

tustvold commented Nov 25, 2024 •

edited

Loading

emilk commented Nov 25, 2024 •

edited

Loading