Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Separate null count and minmax from column stats #11860

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zation99
Copy link
Contributor

Summary:
PR to address #11741

  • Removed the use of columnHasNulls in RowContainer and replaced them with row stats
  • Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated.

Differential Revision: D67229925

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 14, 2024
Copy link

netlify bot commented Dec 14, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 1b5122a
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67628605a4d98900082619cd

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67229925

zation99 added a commit to zation99/velox that referenced this pull request Dec 14, 2024
…ator#11860)

Summary:

PR to address facebookincubator#11741

- Removed the use of columnHasNulls in RowContainer and replaced them with row stats
- Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated.

Differential Revision: D67229925
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67229925

Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zation99 thanks for the improvement % comments. Thanks!

velox/exec/HashProbe.cpp Outdated Show resolved Hide resolved
@@ -303,7 +303,7 @@ std::optional<uint64_t> HashProbe::estimatedRowSize(
varSizeListColumnsStats.reserve(varSizedColumns.size());
for (uint32_t i = 0; i < varSizedColumns.size(); ++i) {
auto statsOpt = columnStats(varSizedColumns[i]);
if (!statsOpt.has_value()) {
if (!statsOpt.has_value() || !statsOpt->isMinMaxColumnStatsValid()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/isMinMaxColumnStatsValid/minMaxColumnStatsValid/

@@ -336,16 +334,38 @@ char* RowContainer::initializeRow(char* row, bool reuse) {
return row;
}

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put comment in header.

minMaxStatsValid_ = false;
}

bool isMinMaxColumnStatsValid() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/isMinMaxColumnStatsValid/minMaxColumnStatsValid/

@@ -229,6 +250,7 @@ class RowColumn {

uint32_t nonNullCount_{0};
uint32_t nullCount_{0};
bool minMaxStatsValid_{true};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this right after min max bytes?

inline bool columnHasNulls(int32_t columnIndex) const {
return columnHasNulls_[columnIndex];
return columnStats(columnIndex).has_value() &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we always has column stats now so can we change columnStats() to const RowColumn::Stats& RowContainer::columnStats? thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unlikely but is it possible that it is initialized with empty rows?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nullopt is to indicate invalidated column stats. Now the invalidation will be of finer granularity, So the column stats for each column will always be valid (not nullopt) but just some fields inside column stats invalidated under some conditions.

// fields. Index aligns with 'rowColumns_'.
std::vector<RowColumn::Stats> rowColumnsStats_;
// Indicates if the min/max column stats are valid. This is set to false
// whenever a row is erased.
bool rowColumnsStatsMinMaxValid_ = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/minMaxColumnStatsValid_/

We need to set it back to true in RowContainer::clear() and please also comment about this

void RowContainer::eraseRows(folly::Range<char**> rows) {
freeRowsExtraMemory(rows, /*freeNextRowVector=*/true);
for (auto* row : rows) {
VELOX_CHECK(!bits::isBitSet(row, freeFlagOffset_), "Double free of row");
bits::setBit(row, freeFlagOffset_);
nextFree(row) = firstFreeRow_;
firstFreeRow_ = row;

removeRowFromColumnStats(row);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/removeRowFromColumnStats/removeRowColumnStats/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... it is actually meant to remove some rows info from the rowColumnStats. removeRowColumnStats sounds like we removed all the stats. wdyt?

// columnHasNulls_ flag is false by default.
inline void updateColumnHasNulls(int32_t columnIndex, bool hasNulls) {
columnHasNulls_[columnIndex] = columnHasNulls_[columnIndex] || hasNulls;
void invalidateMinMaxColumnStats() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we use this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we have eraseRows or setAllNull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some additional ask. And maybe we can discuss this and potentially have a followup: Fixed column size stats can immune from erasure. Do we want to indicate in the column stats if this column is fixed/variable length column? We just need to set the stats size at construction time and skip updating them. Might save some cycles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I'll make a follow up task for this.

velox/exec/tests/RowContainerTest.cpp Show resolved Hide resolved
zation99 added a commit to zation99/velox that referenced this pull request Dec 14, 2024
…ator#11860)

Summary:

PR to address facebookincubator#11741

- Removed the use of columnHasNulls in RowContainer and replaced them with row stats
- Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated.

Differential Revision: D67229925
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67229925

zation99 added a commit to zation99/velox that referenced this pull request Dec 14, 2024
…ator#11860)

Summary:

PR to address facebookincubator#11741

- Removed the use of columnHasNulls in RowContainer and replaced them with row stats
- Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated.

Differential Revision: D67229925
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67229925

zation99 added a commit to zation99/velox that referenced this pull request Dec 15, 2024
…ator#11860)

Summary:

PR to address facebookincubator#11741

- Removed the use of columnHasNulls in RowContainer and replaced them with row stats
- Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated.

Differential Revision: D67229925
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67229925

zation99 added a commit to zation99/velox that referenced this pull request Dec 15, 2024
…ator#11860)

Summary:

PR to address facebookincubator#11741

- Removed the use of columnHasNulls in RowContainer and replaced them with row stats
- Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated.

Differential Revision: D67229925
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67229925

zation99 added a commit to zation99/velox that referenced this pull request Dec 15, 2024
…ator#11860)

Summary:

PR to address facebookincubator#11741

- Removed the use of columnHasNulls in RowContainer and replaced them with row stats
- Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated.

Differential Revision: D67229925
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67229925

zation99 added a commit to zation99/velox that referenced this pull request Dec 15, 2024
…ator#11860)

Summary:

PR to address facebookincubator#11741

- Removed the use of columnHasNulls in RowContainer and replaced them with row stats
- Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated.

Differential Revision: D67229925
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67229925

zation99 added a commit to zation99/velox that referenced this pull request Dec 15, 2024
…ator#11860)

Summary:

PR to address facebookincubator#11741

- Removed the use of columnHasNulls in RowContainer and replaced them with row stats
- Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated.

Differential Revision: D67229925
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67229925

@zation99 zation99 requested a review from xiaoxmeng December 16, 2024 05:42
@@ -316,15 +336,28 @@ class RowContainer {
: 0);
}

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use //

velox/exec/tests/RowContainerTest.cpp Outdated Show resolved Hide resolved
zation99 added a commit to zation99/velox that referenced this pull request Dec 17, 2024
…ator#11860)

Summary:

PR to address facebookincubator#11741

- Removed the use of columnHasNulls in RowContainer and replaced them with row stats
- Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated.

Differential Revision: D67229925
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67229925

/// Merges multiple aggregated stats of the same column into a single one.
static Stats merge(const std::vector<Stats>& statsList);

private:
// Aggregated stats for non-null rows of the column.
int32_t minBytes_{0};
int32_t maxBytes_{0};
bool minMaxStatsValid_{true};
Copy link
Contributor

@tanjialiang tanjialiang Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the name of this variable more general? We could potentially have some more stats that could be invalidable.
And shall we add comments to the part of the stats that are invalidable and the other part of the stats that are not? Something like:

int32_t minBytes_{0};
int32_t maxBytes_{0};

// Flag that ... above stats ...
bool fragileStatsValid_{true};

uint64_t sumBytes_{0};
uint32_t nonNullCount_{0};
uint32_t nullCount_{0};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to leave minMax name here for now to make it more explicit so people easily know what stats are invalidated. But in the future, if there are more stats that cannot be tracked, yes I think we should rename it.

Comment on lines 343 to 344
/// @param keepNullCount - If true, the null count is kept if originally null
/// or increase if originally not null.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If true, the stats is updated as if the row is replaced with an all-null row.

@tanjialiang
Copy link
Contributor

Thanks @zation99 for the changes! Left some comments and we can have some discussions offline.

zation99 added a commit to zation99/velox that referenced this pull request Dec 18, 2024
…ator#11860)

Summary:

PR to address facebookincubator#11741

- Removed the use of columnHasNulls in RowContainer and replaced them with row stats
- Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated.

Differential Revision: D67229925
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67229925

zation99 added a commit to zation99/velox that referenced this pull request Dec 18, 2024
…ator#11860)

Summary:

PR to address facebookincubator#11741

- Removed the use of columnHasNulls in RowContainer and replaced them with row stats
- Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated.

Differential Revision: D67229925
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67229925

zation99 added a commit to zation99/velox that referenced this pull request Dec 18, 2024
…ator#11860)

Summary:

PR to address facebookincubator#11741

- Removed the use of columnHasNulls in RowContainer and replaced them with row stats
- Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated.

Differential Revision: D67229925
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67229925

…ator#11860)

Summary:

PR to address facebookincubator#11741

- Removed the use of columnHasNulls in RowContainer and replaced them with row stats
- Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated.

Differential Revision: D67229925
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67229925

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants