Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute: simplify mz_join_core result computation #19667

Merged
merged 1 commit into from
Jun 14, 2023

Conversation

teskje
Copy link
Contributor

@teskje teskje commented Jun 2, 2023

Previously, the mz_join_core code wrapped the result closure into another work_result closure, to deal with the fact that matches from the todo1 list had their value and diff fields swapped.

Instead of having a closure to unswap the fields, we can simply pass them to the Deferred constructor in the correct order instead. That is, the Deferred constructor should always receive the cursor/storage from input 1 first and from input 2 second. If we make this change, no work_result closure is necessary and the code become easier to reason about.

Motivation

  • This PR refactors existing code.

This is something I stumbled over while working on support for a fueled merge join strategy.

Tips for reviewer

Part of this PR is renaming the fields of Deferred to reflect that (first, second) shouldn't be (trace, batch) but (input 1, input 2). The rest of the PR removes the work_result wrapper closure by inlining its logic into Deferred::work.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:
    • N/A

@teskje teskje marked this pull request as ready for review June 2, 2023 12:29
@teskje teskje requested review from a team, antiguru and vmarcos June 2, 2023 12:29
Copy link
Contributor

@vmarcos vmarcos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed from my side, thanks!

@teskje
Copy link
Contributor Author

teskje commented Jun 5, 2023

On request from @antiguru I opened a backport PR for DD as well: TimelyDataflow/differential-dataflow#392

Previously, the mz_join_core code wrapped the `result` closure into
another `work_result` closure, to deal with the fact that matches from
the `todo1` list had their value and diff fields swapped.

Instead of having a closure to unswap the fields, we can simply pass
them to the `Deferred` constructor in the correct order instead. That
is, the `Deferred` constructor should always receive the cursor/storage
from input 1 first and from input 2 second. If we make this change, no
`work_result` closure is necessary and the code become easier to reason
about.
@teskje teskje force-pushed the simplify-mz_join_core-result branch from 870de8c to 8b6451b Compare June 5, 2023 15:56
Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this mostly looks good. One thing I'm not 100% sure about is renaming trace/batch to cursor 1/2. Can you confirm that they're used symmetrically, i.e., we could swap the arguments and produce the same result?

@teskje
Copy link
Contributor Author

teskje commented Jun 12, 2023

Here are the things that changes for new batches received from input 1 (nothing changes for batches from input 2):

  • We iterate over the batch in the outer loop and the trace in the inner loop. Previously that was flipped. That only affects the order in which outputs are produced, but we don't care about that (and the subsequent consolidation should sort anyway).
  • For an update we produce the time as (batch_t (join) meet) (join) trace_t whereas previously it was (trace_t (join) meet) (join) batch_t. AFAICT, join is associative and commutative, so we get the same results.
  • For an update we produce the diff as batch_diff.multiply(trace_diff), whereas previously it was trace_diff.multipy(batch_diff). Nothing states that multiply is a commutative operation, so we might get differences here. However, I would argue that it is more correct to always multiply input 1 and input 2 in the same order, rather than switching the order around depending on when updates arrive on the inputs.

@teskje teskje merged commit 3c33bef into MaterializeInc:main Jun 14, 2023
@teskje teskje deleted the simplify-mz_join_core-result branch June 14, 2023 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants