Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get streaming hash aggregate back #913

Merged
merged 1 commit into from
Feb 11, 2025
Merged

Conversation

avamingli
Copy link
Contributor

This PR reintroduces streaming hash aggregation functionality to CBDB (Cloudberry Database). While it is inspired by the implementation in GPDB, it is not a straightforward cherry-pick due to significant differences between GPDB and CBDB, particularly because CBDB is based on the PostgreSQL 14 kernel. As a result, I had to make several adjustments to resolve conflicts and adapt the code to the CBDB codebase.

For example:
The lookup_hash_entry function, which was modified in GPDB, has been removed in PostgreSQL 14. I adapted the code to work with the current CBDB implementation.

Some fault injection mechanisms from GPDB could not be ported directly due to incompatibilities or missing dependencies in CBDB. For now, these fault injection points have been removed to ensure that streaming hash aggregation works correctly in CBDB. Future work will focus on reintroducing these mechanisms once the necessary infrastructure is in place.

Checklist

  • Code changes reviewed and approved.
  • Regression tests passed.
  • Performance tests conducted for streaming hash aggregation.
  • Documentation updated (if applicable).

This PR lays the groundwork for streaming hash aggregation in CBDB and ensures compatibility with the PostgreSQL 14 kernel. Your feedback and review are greatly appreciated!


Let me know if you need further adjustments!

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


In multi-phase aggregate, this feature streams entries of the bottom
phase when out of memory instead of spilling to disk, it avoids the disk
I/O and roughly deduplicates to save the network I/O.

The planner always supports streaming aggregate and there are lots of
cases, but the executor part was lost while merging PostgreSQL 12, this
commit gets it back.
@avamingli
Copy link
Contributor Author

avamingli commented Feb 10, 2025

I retain the original commit message from GPDB since most of the work was done by them, not me.

@reshke
Copy link
Contributor

reshke commented Feb 10, 2025

Looks like CI need simple restart

@reshke
Copy link
Contributor

reshke commented Feb 10, 2025

I guess this is related to #914 so posting to link them

Copy link
Contributor

@my-ship-it my-ship-it left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@yjhjstz yjhjstz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@my-ship-it my-ship-it merged commit 36868ee into apache:main Feb 11, 2025
23 checks passed
@avamingli avamingli deleted the streaming_hash branch February 11, 2025 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants