Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shrink_to_fit to Array #6360

Closed
teh-cmc opened this issue Sep 5, 2024 · 12 comments · Fixed by #6790
Closed

Add shrink_to_fit to Array #6360

teh-cmc opened this issue Sep 5, 2024 · 12 comments · Fixed by #6790
Labels
arrow Changes to the arrow crate

Comments

@teh-cmc
Copy link

teh-cmc commented Sep 5, 2024

We keep a lot of Arrow data in long-lived memory storage, and therefore must be guaranteed that the data has been optimized for space before it gets committed to storage, regardless of how it got built or how it got there.

This would also provide a strongly typed non-trait version, similar to what is done e.g. for slice.

See this PR for more background context:

@emilk
Copy link
Contributor

emilk commented Nov 21, 2024

I've been thinking a bit on how to implement this. It seems to be a lot of work, but doable.

Ideally we should be able to take any ArrayRef (Arc<dyn Array>) and slim it down using a shrink_to_fit method.
The desired capacity should probably be the required size rounded up to an even multiple of 64 bytes (following the lead of MutableBuffer::shrink_to_fit).

One thing we need to ensure is to only allocate and copy memory if the desired capacity is lower than the current capacity. That is: shrink_to_fit should be idempotent, and fast on the second call.

fn shrink_to_fit(&self) -> ArrayRef

In #6300 (comment), @tustvold wrote:

I think that would be generally useful, and would also handle the case of sliced arrays. I'd recommend something that returns a new Array, e.g. fn shrink_to_fit(&self) -> ArrayRef.

In other words:

trait Array {/// Return a clone of self using the minimal amount of memory,
    /// rounded up to the nearest multiple of 64 bytes.
    ///
    /// If the array is already using a minimal amount of memory, this returns a clone,
    /// which is usually cheap because everything bottoms out into a [`Buffer`] which is just a ref-counted [`Bytes`].
    fn shrink_to_fit(&self) -> ArrayRef;
}

This would still require a small memory-allocation for each call (a call to Arc::new), but the actual data would only need copying when there is actual opportunity for slimming down (e.g. when there is extra capacity and/or when the Array is a slice of some bigger chunk of memory).

In practice, this would look something like this:

impl Array for BooleanArray {fn shrink_to_fit(&self) -> ArrayRef {
        Arc::new(Self{
            values: self.values.shrink_to_fit(),
            nulls: self.nulls.as_ref().map(|b| b.shrink_to_fit()),
        }}
    }
}

…assuming we also implement fn shrink_to_fit(&self) -> Self on BooleanBuffer, NullBuffer etc, which all bottom out in implementing

fn Buffer {
    fn shrik_to_fit(&self) -> Self {
        let new_capacity = bit_util::round_upto_multiple_of_64(self.len());
        if self.data.capacity <= new_capacity {
            self.clone() // fast path
        } else {
            // allocate new memory (`Bytes`) using the `std::alloc::alloc` and copy to it
        }
    }
}

Upsides

You can call .shrink_to_fit on basically anything (all arrays and all buffers).
It's all immutable by default.

Downsides

Each call will require at least one memory allocation, even in the fast case.
For something like StructArray it will require more, since even a shallow clone of StructArray requires cloning

fields: Vec<ArrayRef>,

There is not easy way to opt-out of this, though a user could use a heuristic like:

fn maybe_shrink(array: ArrayRef) -> ArrayRef {
    if array.get_buffer_memory_size() < 1024 {
        array // not worth the overhead of calling shrink_to_fit
    } else {
        array.shrink_to_fit()
    }
}

Perhaps this is good enough?

Alternative designs

I don't really have any, except for the one in #6300

I am new to the code base though, so I might be missing something.

Request for feedback

Thoughts @alamb @tustvold @teh-cmc ?

If the fn shrink_to_fit(&self) -> ArrayRef plan sounds good, I can start implementing it.

@alamb
Copy link
Contributor

alamb commented Nov 22, 2024

Each call will require at least one memory allocation, even in the fast case.

What if you made shrink_to_fit consuming. Something like this:

impl Array for BooleanArray {fn shrink_to_fit(self) -> Self {
        Self{
            values: self.values.shrink_to_fit(),
            nulls: self.nulls.as_ref().map(|b| b.shrink_to_fit()),
        }
    }
}

Then if you already had an owned array no clone would be required at all.

It would take some finagling to get this out of an ArrayRef - maybe something like

let arr: ArrayRef  = ...;
if let bool_array_arc: Arc<BooleanArray> = arr.downcast() {
  bool_array_arc.unwrap_or_clone().shrink_to_fit()
}

Though I suppose you still need to rewrap into an Arc to get an array ref 🤔

@alamb
Copy link
Contributor

alamb commented Nov 22, 2024

If you made the API consuming then you wouldn't force a clone if you were working with the array types directly (though if you started with an ArrayRef and wanted to end with an ArrayRef it would probably end up with the same amount of allocations as your proposal)

@emilk
Copy link
Contributor

emilk commented Nov 25, 2024

You are right, a consuming API seems better, on all levels. I'll probably start work on this later this week.

@emilk
Copy link
Contributor

emilk commented Nov 25, 2024

Actually, no. We can make shrink_to_fit consuming everywhere except on trait Array, but fn shrink_to_fit(self) -> Self is not object safe.

@tustvold
Copy link
Contributor

tustvold commented Nov 25, 2024

I wonder if we should give some thought to how this may integrate with types such as DictionaryArray or StringViewArray that can be compacted/GC.

I worry a little that a shrink_to_fit API might be constraining, and whether we would be better adding a first-class compact kernel within e.g. arrow-select? This would then be able to take an options struct to control its behaviour.

@emilk
Copy link
Contributor

emilk commented Nov 25, 2024

The actual problem that we need to solve is the capacity of repeated concat:ination. (details: #6300 (comment) and rerun-io/rerun#7222). This is currently a blocker for us for switching away from arrow2 to arrow-rs.

I don't know the arrow codebase well enough to even know what arrow-select is. concat lives there, so maybe it makes sense for shrink_to_fit to live there too? But how would it work in practice - dynamically downcasting a dyn Array to all possible subtypes that we have implemented shrink_to_fit for? Is there something similar I can take a look at and use as a template?

@tustvold
Copy link
Contributor

tustvold commented Nov 25, 2024

that we need to solve is the capacity of repeated concat:ination

It sounds like #6692 might be what you're actually wanting as a long-term API, although this will require someone very familiar with the arrow codebase to meaningfully action.

But how would it work in practice - dynamically downcasting a dyn Array to all possible subtypes that we have implemented shrink_to_fit for?

Pretty much, there are macros to make this slightly less obnoxious.

Is there something similar I can take a look at and use as a template?

Pretty much all the kernels do some variation of this


That all being said, shrink_to_fit can just implement the basic logic, and if/when we add a compact kernel that can simply call shrink_to_fit. This could allow progress to be made on this now.

Regarding the consuming vs non-consuming versions, etc... one idea might be to just have:

/// Shrink any unshared buffers
fn shrink_to_fit(&mut self);

And then use Arc::get_mut to chain through reference counts, this would prohibit shrinking a shared array, I couldn't tell from the links whether the desire is to get a compact deep clone, or just a best-effort optimisation of unshared buffers

@emilk
Copy link
Contributor

emilk commented Nov 25, 2024

I have a working version of fn shrink_to_fit(&self) -> ArrayRef now in #6787 (PTAL!). However, it has the (big) drawback that I failed to implement it for TypedDictionaryArray and TypedRunArray, which are types wrapping references.

fn shrink_to_fit(&mut self) may indeed be a better approach.

@emilk
Copy link
Contributor

emilk commented Nov 25, 2024

I think trait Array { fn shrink_to_fit(&mut self) {} } would also require having trait Array { fn clone(&self) -> ArrayRef; } so that shrink_to_fit can be implemented for ArrayRef in the case of a shared reference

@tustvold
Copy link
Contributor

Yeah, I thought we might be able to get around this using Arc::get_mut for ArrayRef, but this won't work for the typed wrappers which only have access to an immutable ref.

I think a compact kernel may be the only consistent way to proceed, it is certainly the approach that is more consistent with other functionality

@emilk
Copy link
Contributor

emilk commented Nov 25, 2024

The new approach of fn shrink_to_fit(&mut self) works well: #6790

The only thing left to implement is to handle the case of a shared ArrayRef, which I see three paths forward on:

A) ignore it (if it is shared, we can't free up memory)
B) Clone the contents with a call to self.slice(0, self.len()) and then call .shrink_to_fit on the results (will only work if slice returns a non-shared ArrayRef)
C) Add trait Table { fn clone(&self) -> Box<dyn Array> and then call that in the shared case, and call .shrink_to_fit on the result (bullet-proof).

I'm actually happy with any of these three paths.

Please take a look at #6790 and tell me what you think. I find it a rather clean approach, as it doesn't require the dynamic downcasting of arrow-select

@alamb alamb added the arrow Changes to the arrow crate label Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
4 participants