Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert to unsorted Dataframe for Delta and Hudi #109

Merged
merged 1 commit into from
Jan 14, 2025

Conversation

istreeter
Copy link
Collaborator

In #102 I changed how we partition and sort the DataFrame before writing to the Lake.

The new partitioning is working well for all output formats. But the extra sort step is not strictly unnecessary for Delta and Hudi. This commit removes the sort step for Delta/Hudi only, to slightly improve cpu utilization.

In #102 I changed how we partition and sort the DataFrame before writing
to the Lake.

The new partitioning is working well for all output formats. But the
extra sort step is not strictly unnecessary for Delta and Hudi. This
commit removes the sort step for Delta/Hudi only, to slightly improve
cpu utilization.
@istreeter istreeter force-pushed the remove-redundant-sort branch from e5d775c to ddcf87c Compare January 13, 2025 20:12
@istreeter
Copy link
Collaborator Author

Sharing a graph which demonstrates this has a measurable impact. The metric shown is processing_latency_millis. The left-hand-side (yellow) is after this current change. The right-hand-side (green) is before this current change. There was a controlled constant stream of incoming events, at ~156K events per second. Loader was running with 4 available CPU.

The metric shows that the current PR improves the processing latency by a small but consistent amount.

processing-latency-millis

There was also a slight measurable improvement in cpu utilization.

Copy link
Contributor

@spenes spenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@istreeter istreeter merged commit 404aac2 into develop Jan 14, 2025
2 checks passed
@istreeter istreeter deleted the remove-redundant-sort branch January 14, 2025 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants