Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of filter operation on ByteViewArray #6154

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

chloro-pn
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

Performing a filter operation on ByteViewArray only filters its view data members, but clones the entire buffers, which may result in the ByteViewArray holding some unnecessary buffers, which may prevent some memory resources from being released in a timely manner.
This PR records whether each buffer needs to be retained while filtering view data members, and updates the buffers of ByteViewArray using the recorded information.
The reason why this PR replaces unwanted buffers with empty buffers instead of deleting them directly is that this processing preserves the mapping relationship between views and buffers.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 29, 2024
@alamb
Copy link
Contributor

alamb commented Jul 29, 2024

Thanks @chloro-pn. 🙏 We have gone back and forth on this idea while integrating StringView into datafusion

The StringViewArray has a gc method but this does require an extra copy of the views

https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.gc

In fact @XiangpengHao used this API in apache/datafusion#11587 to solve exactly the problem you are describing (too much unused data in the buffers)

However, what I worry about is that the heuristic to determine when to compact the string data / buffers will not be ideal for any particular usecase and that one princple of this crate is to give the user maximal control over performance

So I would like to propose we support two different modes for filter kernels:

  1. Filter only views
  2. Filter the views and copy matchings strings to a new buffer

@XiangpengHao I wonder if you have any thoughts to add here?

@chloro-pn
Copy link
Contributor Author

chloro-pn commented Jul 29, 2024

From the documentation of gc, it appears to do at least two things:

  • Compress data buffer (expensive)
  • Clean up the buffers based on the view list (cheap)

The modification of this PR does not perform expensive operations of compressing data buffers, it only releases the reference count of buffer that no longer need to be held (the resources held by buffer are managed based on Arc's reference counting mechanism)
So this PR cannot completely solve the problem you mentioned above (because compared to the gc method, it only does the less expensive part, for example, if all the buffer in the buffers hold different parts of the same Bytes, then this PR is meaningless because Byte will always be held), but it also does not introduce expensive overhead.

@XiangpengHao
Copy link
Contributor

XiangpengHao commented Jul 29, 2024

I think this is a clever PR and is complementary to the gc approach. If I understand correctly, it replaces unused buffers with empty buffers so they can be released.

There's a hierarchy of gcs: (1) replace unused buffer with empty buffer, without touching the view array (2) directly remove the unused buffers (so that the buffer id array is shorter), but need to correct the buffer ids in the view array (3) complete gc, release all underlying buffers and create a new compact buffer, as implemented in the current gc function.

This PR makes the first approach, which is transparent to users and has relatively low overhead.

Moving forward, I think we need some tests to cover the case that unused buffers are actually replaced by empty buffers.

As a side note, you might want to check #6094 as well, as we found that excessive amount of buffers can significantly slows down common operations like concat or get_array_memory_size, in which case the second type of gc may help.

@chloro-pn
Copy link
Contributor Author

My suggestion is that we can treat the second type of GC like the current GC (third type GC) as an independent method that users can choose to call.
The first type of GC in this PR can be placed in a new filter method, temporarily called filter2, to distinguish the old filter method. The filter2 method can return some statistical information obtained during the filtering process, which can help users choose whether to perform other GC operations in the future. For example, we can return the number of empty buffers and the proportion of empty buffers in the buffers to guide users whether to perform the second type of GC.
Here, we can also obtain statistical information to guide users on whether to call the third type of GC, but I currently do not understand the implementation details of the third type of GC. I need to look at the source code later.
The reason for encapsulating the first type of GC into filter2 instead of providing it as a standalone method like other types of GC is:
It is lightweight and can be processed together during the filtering process. Putting the second type of GC in the filter method is not more advantageous than calling it as an independent method.

@alamb
Copy link
Contributor

alamb commented Jul 29, 2024

My concern with this approach (and with any of the "gc while filtering" approaches) is that it will work well for some use cases but be unavoidable overhead with other use cases.

I think we need an API that allows callers to specify how they want the GC to be done

What I have talked about with @XiangpengHao and a few others is making some sort of stateful filter API; Among other things such an API would give us a place to put options for filtering

So maybe it would look like

// create a new filterer 
let filterer = ArrayFilterer::new()
  // specify that GC should look for empty buffers
  .with_gc_strategy(GcStrategy::FindEmptyBuffers);

// feed in arrays and their filters (BooleanArray)
filterer.push(array1, filter1)?;
filterer.push(array2, filter2)?;

// Produce the final output array
let filtered_array = filterer.build();

@chloro-pn
Copy link
Contributor Author

My concern with this approach (and with any of the "gc while filtering" approaches) is that it will work well for some use cases but be unavoidable overhead with other use cases.

I think we need an API that allows callers to specify how they want the GC to be done

What I have talked about with @XiangpengHao and a few others is making some sort of stateful filter API; Among other things such an API would give us a place to put options for filtering

So maybe it would look like

// create a new filterer 
let filterer = ArrayFilterer::new()
  // specify that GC should look for empty buffers
  .with_gc_strategy(GcStrategy::FindEmptyBuffers);

// feed in arrays and their filters (BooleanArray)
filterer.push(array1, filter1)?;
filterer.push(array2, filter2)?;

// Produce the final output array
let filtered_array = filterer.build();

Provide a new API instead of modifying the existing one, agree with this and leave the choice to the user.

@alamb
Copy link
Contributor

alamb commented Jul 29, 2024

Provide a new API instead of modifying the existing one, agree with this and leave the choice to the user.

I will try and write this up over the next day or two into a ticket that is more coherent

@alamb
Copy link
Contributor

alamb commented Jul 31, 2024

I still have writing up the ideas into a more coherent form on my list, but I probably won't get to it for the next few days (I have too too many other projects in flight at the moment)

@alamb
Copy link
Contributor

alamb commented Aug 15, 2024

I have not had a chance to write up these ideas yet, I apologize

@chloro-pn
Copy link
Contributor Author

It's okay, this is not an urgent matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants