Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another attempt at introducing a columnated merge batcher #418

Merged

Conversation

antiguru
Copy link
Member

This builds on #414 but uses the new Layout feature recently merged.

@antiguru antiguru force-pushed the columnated_merge_batcher_new branch from 6eef06b to e232749 Compare November 16, 2023 21:54
@antiguru antiguru force-pushed the columnated_merge_batcher_new branch from fe34a22 to fe33ecd Compare November 17, 2023 18:47
@@ -163,11 +161,13 @@ impl<T: Columnation> RetainFrom<T> for TimelyStack<T> {

/// An immutable collection of update tuples, from a contiguous interval of logical times.
#[derive(Abomonation)]
pub struct OrdValBatch<L: Layout> {
pub struct OrdValBatch<L: Layout, C> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reading this, and .. while I can probably figure it out I have no clue what C is meant to be. Would you be willing to add some doc text about it? It's a bit awkward that it plays no role for some time, leaving no hints about what it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read a bit more, and .. it seems like this is mostly a newtype hack? Like, C could be bool and String and line up with the Vec and TimelyStack merge batchers, if the impls said that. It feels a bit weird!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could make the merge batcher part of the layout and it would resolve all these issues. But it also feels odd because it's not really part of the layout... What I can't wrap my head around is how to get from an update container on the layout to a merge batcher implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I can't see how to make it part of the layout. We don't know the Batch yet, and the batcher needs to be parametrized with Self a.k.a. the batch type :/

src/trace/mod.rs Outdated Show resolved Hide resolved
@antiguru antiguru force-pushed the columnated_merge_batcher_new branch from 487b738 to cbd1dad Compare November 19, 2023 01:33
Copy link
Member

@frankmcsherry frankmcsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left several comments, only a few of which are "important" to chase down and most are either questions or stylistic nits. The main one I'd like to figure out is whether you want to double the capacity of pending to ensure you ship full containers. That seems relatively impactful. Several other discussion points that we can chase down here or 1:1.

keep.copy(datum);
}
else {
builder.push((key.clone(), val.clone(), time.clone(), diff.clone()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self that we should add a copy method to the builder trait.

src/trace/implementations/merge_batcher_col.rs Outdated Show resolved Hide resolved
}
}

struct TimelyStackQueue<T: Columnation> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is weird. :D At least, a stack and a queue are different things. I guess this is a queue implemented over a stack, though.

src/trace/implementations/merge_batcher_col.rs Outdated Show resolved Hide resolved
Comment on lines +172 to +182
/// Buffer size (number of elements) to use for new/empty buffers.
const fn buffer_size() -> usize {
let size = std::mem::size_of::<(D, T, R)>();
if size == 0 {
Self::BUFFER_SIZE_BYTES
} else if size <= Self::BUFFER_SIZE_BYTES {
Self::BUFFER_SIZE_BYTES / size
} else {
1
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is closely coupled to constants you've picked over in lg_alloc. It's probably wise to jot that down, in case someone changes that constant and then doesn't understand why nothing pages anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, and worth discussing: I think this size has roughly nothing to do with what lg_alloc will have to absorb? It indicates how large a Vec<(D, T, R)> will be, but the container will instead hold on to the spilled data, which can be arbitrarily larger (e.g. if D is mz_repr::Row), or arbitrarily smaller. It might be a good set-up for when you are able to back the Vec itself in lg_alloc, but .. is it really doing the right thing until then?

Seems not terrible ofc, but idk if we are sure that the backing allocations will be able to be paged out using this logic. Perhaps it's fine, in that it ensures that we take a large enough bite that the Vec would be that large, and if the alloc is smaller then we have at most 2x the Vec in memory (it plus a bit less in the region, in the worst case), which is decent insurance to provide (vs a smaller Vec which might have proportionally more in the region, and still not spill out).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a pending_buffer_size function that returns 2*Self::buffer_size() so that pending contains enough data to fill a whole buffer. I need to test the performance impact of this, but it seems like the right change to do.

src/trace/implementations/merge_batcher_col.rs Outdated Show resolved Hide resolved
src/trace/implementations/merge_batcher_col.rs Outdated Show resolved Hide resolved
src/trace/implementations/merge_batcher_col.rs Outdated Show resolved Hide resolved
src/trace/implementations/merge_batcher_col.rs Outdated Show resolved Hide resolved
src/trace/implementations/merge_batcher_col.rs Outdated Show resolved Hide resolved
@frankmcsherry
Copy link
Member

I have another question too: could the merger be for arbitrary BatchContainer types, or does it just work for TimelyStack? Is there some BC::prefered_capacity() method we might add to make it general?

Signed-off-by: Moritz Hoffmann <[email protected]>
@antiguru
Copy link
Member Author

I have another question too: could the merger be for arbitrary BatchContainer types, or does it just work for TimelyStack? Is there some BC::prefered_capacity() method we might add to make it general?

I'd prefer a single general implementation, but it's not easy to get there:

  • For the vector-based implementation, we can recycle buffers from the input and don't need to traverse the pending buffer. We could, but we first would need to do the measurements to see what impact it might have.
  • For Columnation-based containers, we have a nice copy_destructured API, which allows us to avoid allocating a tuple just because we have no alternative way to represent its parts. We don't have this API for Vec, so we'd need to invent something.
  • The rest seems to be mostly typing work, so not very difficult, and it might create some clarity around the C parameter to OrdValBatch.

Signed-off-by: Moritz Hoffmann <[email protected]>
@antiguru antiguru force-pushed the columnated_merge_batcher_new branch from 0a68e4f to 8fb8e61 Compare November 20, 2023 02:14
@frankmcsherry
Copy link
Member

Looks good, and let's merge!

@frankmcsherry frankmcsherry merged commit 9163a29 into TimelyDataflow:master Nov 20, 2023
1 check passed
@antiguru antiguru deleted the columnated_merge_batcher_new branch November 20, 2023 15:09
This was referenced Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants