Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Join Requirements for Aggregate All Ever #36

Open
bcodell opened this issue Apr 13, 2023 · 3 comments · May be fixed by #39
Open

Refactor Join Requirements for Aggregate All Ever #36

bcodell opened this issue Apr 13, 2023 · 3 comments · May be fixed by #39
Assignees
Labels
bug Something isn't working

Comments

@bcodell
Copy link
Collaborator

bcodell commented Apr 13, 2023

Description

The aggregate_all_ever relationship can cause exploding joins if self-joins an activity with high cardinality. For example, on a customer stream with a placed_order activity, a dataset with primary activity placed_order and appended activity aggregate_all_ever_placed_order would create such an issue.

Proposal

Add conditional logic in the dataset macro:

  • If the relationship is aggregate_all_ever, aggregate the appended activity directly from the stream without joining to the primary activity CTE (change in here)
  • If the relationship is aggregate_all_ever, alter the join in the final rejoin CTE to join on entity_id instead of activity_id (change in here)
@bcodell
Copy link
Collaborator Author

bcodell commented Apr 13, 2023

@tnightengale ran into this in my own version of the project, and the proposed fix worked great - a dataset with >30MM rows that wasn't materializing in 20+ minutes is now running successfully in 30 seconds. This issue definitely creates a scaling problem for users, so good to resolve quickly. I can take a stab at a fix, but let me know if you have any questions.

@bcodell bcodell added the bug Something isn't working label Apr 13, 2023
@bcodell bcodell self-assigned this Apr 13, 2023
@tnightengale
Copy link
Owner

100% let's fix this asap 👍

@bcodell
Copy link
Collaborator Author

bcodell commented Apr 26, 2023

Sounds good! I'm on vacation the rest of this week but can tackle next week. If you want to get this fixed sooner, though, have at it! For reference, My approach in my own project was to add conditional logic based on the relationship of the appended activity. For everything but aggregate_all, use the existing join logic. Otherwise, don't join back to the primary CTE - just select the appended activity and aggregate on its own customer column. Then add the same conditional logic in the final join statement so that those CTEs join on customer instead of activity_id

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants