Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Backport 0.57.9 reverts" #20202

Closed
wants to merge 10 commits into from

Conversation

antiguru
Copy link
Member

@antiguru antiguru commented Jun 28, 2023

Reverts #20124.

Only merge after TimelyDataflow/differential-dataflow#398 is fixed. We selected to have a solution purely within Materialize to address the issue.

Best viewed commit-by-commit:

  1. Add the arrangement heap size feature, clean revert.
  2. Remove the consolidate functionality introduced by the first PR. This avoids the memory regression we encountered.
  3. Add the MzArranged wrapper to track trace usage. This resolves holding on to the trace, even in other situations than consolidate.
  4. Move compaction, which should only be a cosmetic change.

@antiguru antiguru force-pushed the revert-20124-backport_57.9_reverts branch 3 times, most recently from 1ba6c81 to 5d4f5c8 Compare July 11, 2023 14:46
@antiguru antiguru marked this pull request as ready for review July 11, 2023 16:30
@antiguru antiguru requested review from a team July 11, 2023 16:30
@antiguru antiguru requested review from a team as code owners July 11, 2023 16:30
@antiguru antiguru removed request for a team July 11, 2023 17:00
@uce
Copy link
Contributor

uce commented Jul 12, 2023

Great to see this back on track to main. 🎉

Did you have a chance to verify that the test cases we added to detect increased memory usage fail without your additional changes (bc0ae72, 5d4f5c8)?

Copy link
Contributor

@vmarcos vmarcos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new approach looks reasonable to me, but I feel that the size and complexity of the change after revert/backport/revert rounds is making reviewing more difficult here (at least for me). For example, there is a change below regarding AHash that appears to have been necessary before, but is no longer now?

Perhaps we should use complementary techniques to increase confidence in addition to reviewing. A few suggestions for your consideration of what is feasible:

  1. It would be good to get a full run of the feature benchmark for this PR, now that it includes memory measurements.
  2. It would also be interesting to see if this PR changes the results of testdrive: Check mz_arrangement_sharing #20180.
  3. It would be good to run this change in staging with a large-ish workload (e.g., the TPC-H load generator with updates and all 22 queries maintained as indexed views) and contrast memory requirements for the same without this change.
  4. Ask the QA team for what else they could throw at this.

src/compute/Cargo.toml Outdated Show resolved Hide resolved
@antiguru
Copy link
Member Author

Did you have a chance to verify that the test cases we added to detect increased memory usage fail without your additional changes (bc0ae72, 5d4f5c8)?

I manually ran the feature benchmark with CustomerWorkload1, and it shows an increase in memory but doesn't report a regression. A change of 6.7% seems to be under the threshold for reporting regressions, so we might need to tune regression reporting on memory stats.

@antiguru antiguru force-pushed the revert-20124-backport_57.9_reverts branch from 5d4f5c8 to 4e94ec2 Compare July 12, 2023 14:17
@antiguru
Copy link
Member Author

I restructured the PR to have a clean revert at the beginning, then removal of mz_consolidate[_if], followed by MzArrange and a compaction nit change. I kicked off a feature benchmark run on the initial commit to show that there was a memory regression, which should have disappeared now.

https://buildkite.com/materialize/nightlies/builds/2800

@antiguru
Copy link
Member Author

The memory regression test would have caught the increased memory utilization: https://buildkite.com/materialize/nightlies/builds/2800#01894a9a-869b-4320-bf7b-d33ed2ef0351

This is a great signal and means that once this PR passes the feature benchmark, we have much higher confidence that it doesn't cause the same regression again.

@antiguru antiguru force-pushed the revert-20124-backport_57.9_reverts branch 4 times, most recently from ba93001 to a723330 Compare July 14, 2023 15:55
@teskje teskje self-requested a review July 17, 2023 09:11
@def-
Copy link
Contributor

def- commented Jul 17, 2023

This seems like it causes some additional arrangements:

[2023-07-17T09:53:11Z] mz-arrangement-sharing.td:804:1: error: non-matching rows: expected:
[2023-07-17T09:53:11Z] [["Arrange Timely(Reachability)"], ["ArrangeByKey Compute(DataflowCurrent)"], ["ArrangeByKey Compute(DataflowDependency)"], ["ArrangeByKey Compute(FrontierCurrent)"], ["ArrangeByKey Compute(FrontierDelay)"], ["ArrangeByKey Compute(ImportFrontierCurrent)"], ["ArrangeByKey Compute(PeekCurrent)"], ["ArrangeByKey Compute(PeekDuration)"], ["ArrangeByKey Differential(ArrangementBatches)"], ["ArrangeByKey Differential(ArrangementRecords)"], ["ArrangeByKey Differential(Sharing)"], ["ArrangeByKey Timely(Addresses)"], ["ArrangeByKey Timely(Channels)"], ["ArrangeByKey Timely(Elapsed)"], ["ArrangeByKey Timely(Histogram)"], ["ArrangeByKey Timely(MessagesReceived)"], ["ArrangeByKey Timely(MessagesSent)"], ["ArrangeByKey Timely(Operates)"], ["ArrangeByKey Timely(Parks)"]]
[2023-07-17T09:53:11Z] got:
[2023-07-17T09:53:11Z] [["Arrange Timely(Reachability)"], ["ArrangeByKey Compute(ArrangementHeapAllocations)"], ["ArrangeByKey Compute(ArrangementHeapCapacity)"], ["ArrangeByKey Compute(ArrangementHeapSize)"], ["ArrangeByKey Compute(DataflowCurrent)"], ["ArrangeByKey Compute(DataflowDependency)"], ["ArrangeByKey Compute(FrontierCurrent)"], ["ArrangeByKey Compute(FrontierDelay)"], ["ArrangeByKey Compute(ImportFrontierCurrent)"], ["ArrangeByKey Compute(PeekCurrent)"], ["ArrangeByKey Compute(PeekDuration)"], ["ArrangeByKey Compute(ShutdownDuration)"], ["ArrangeByKey Differential(ArrangementBatches)"], ["ArrangeByKey Differential(ArrangementRecords)"], ["ArrangeByKey Differential(Sharing)"], ["ArrangeByKey Timely(Addresses)"], ["ArrangeByKey Timely(Channels)"], ["ArrangeByKey Timely(Elapsed)"], ["ArrangeByKey Timely(Histogram)"], ["ArrangeByKey Timely(MessagesReceived)"], ["ArrangeByKey Timely(MessagesSent)"], ["ArrangeByKey Timely(Operates)"], ["ArrangeByKey Timely(Parks)"]]
[2023-07-17T09:53:11Z] Poor diff:
[2023-07-17T09:53:11Z] + "ArrangeByKey Compute(ArrangementHeapAllocations)"
[2023-07-17T09:53:11Z] + "ArrangeByKey Compute(ArrangementHeapCapacity)"
[2023-07-17T09:53:11Z] + "ArrangeByKey Compute(ArrangementHeapSize)"
[2023-07-17T09:53:11Z] + "ArrangeByKey Compute(ShutdownDuration)"

Seen in #20590 Is this expected?

Edit: This is in the introspection debugging views, so probably fine.

@teskje
Copy link
Contributor

teskje commented Jul 17, 2023

Edit: This is in the introspection debugging views, so probably fine.

Just confirming that these are expected!

Copy link
Contributor

@def- def- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other mz_arrangement_sharings were the same as before. Please rebase this on main, otherwise we'll have merge skew with the mz_arrangement_sharing checking in testdrive.

@antiguru antiguru force-pushed the revert-20124-backport_57.9_reverts branch 2 times, most recently from 8b895eb to e25d57f Compare July 17, 2023 11:59
@ggnall
Copy link
Contributor

ggnall commented Jul 17, 2023

antiguru and others added 2 commits July 18, 2023 11:10
MzArranged. This tracks whether any client uses the trace, and only then
installs the arrangement size logging operator.

Signed-off-by: Moritz Hoffmann <[email protected]>
This commit adds the enable_arrangement_size_logging feature flag and wires
it up such that clusters construct dataflows with or without arrangement
size logging.

Signed-off-by: Moritz Hoffmann <[email protected]>
Signed-off-by: Moritz Hoffmann <[email protected]>
Signed-off-by: Moritz Hoffmann <[email protected]>
@antiguru antiguru force-pushed the revert-20124-backport_57.9_reverts branch from e25d57f to 5dc6505 Compare July 18, 2023 09:11
Copy link
Contributor

@teskje teskje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using unsafe to protect against the memory leak might be considered an antipattern, since unsafe is meant to protect against memory safety violations and memory leaks are considered safe in Rust. But I don't think we have a better way to force callers to double-check, so this seems fine. We should, however, also make MzArranged::trace unsafe, I think.

The plumbing for the feature flag is quite horrible, but I suppose we can live with it since we don't plan to keep it around forever. Is there a follow-up issue for removing it after a while, so we don't forget?

src/compute/src/extensions/arrange.rs Outdated Show resolved Hide resolved
src/compute/src/extensions/arrange.rs Outdated Show resolved Hide resolved
Arranged {
trace: arranged.trace,
stream,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check my understanding: We now throw away the output of the unary operator instead of passing through its input. That means we will perform additional copies of the stream updates, right?

If we don't use the output anymore, should we make the operator a sink?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding is correct. The problem is that we need to attach the logging operator to the stream potentially after someone else attached themselves to the stream already, so we need to clone anyways. The clones are cheap, because we send Rc<Batch>, so I'm not worried about this.

We can't turn the operator into a sink because sink doesn't give us access to the OperatorInfo, which we need to determine the operator's address to be able to activate it from the outside. We could rewrite it in terms of builder_rc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think unary is fine. Though should we maybe remove the output.session(&time).give_container(&mut buffer); to not make readers think that the output will be used?

src/compute/src/extensions/arrange.rs Outdated Show resolved Hide resolved
src/compute/src/extensions/arrange.rs Outdated Show resolved Hide resolved
src/compute/src/extensions/arrange.rs Show resolved Hide resolved
src/compute-client/src/protocol/command.rs Outdated Show resolved Hide resolved
src/compute/src/logging/reachability.rs Outdated Show resolved Hide resolved
src/compute/src/logging/timely.rs Outdated Show resolved Hide resolved
{
// Reuseable allocation for unpacking.
let mut datums = DatumVec::new();
let mut row_builder = Row::default();

// Safety: all `join_impl`s holds on to the trace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Safety: all `join_impl`s holds on to the trace.
// Safety: all `join_impl`s hold on to the trace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linear join impls drop their trace handles when the corresponding streams reach the empty frontier. I assume that's allowed under the safety requirements?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the expected behavior. In this case, the trace should only contain a single empty batch.

Signed-off-by: Moritz Hoffmann <[email protected]>
@antiguru antiguru closed this Aug 11, 2023
@antiguru antiguru deleted the revert-20124-backport_57.9_reverts branch December 22, 2023 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants