-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Separate null count and minmax from column stats #11860
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
This pull request was exported from Phabricator. Differential Revision: D67229925 |
…ator#11860) Summary: PR to address facebookincubator#11741 - Removed the use of columnHasNulls in RowContainer and replaced them with row stats - Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated. Differential Revision: D67229925
c6c4c8c
to
329f9a9
Compare
This pull request was exported from Phabricator. Differential Revision: D67229925 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zation99 thanks for the improvement % comments. Thanks!
velox/exec/HashProbe.cpp
Outdated
@@ -303,7 +303,7 @@ std::optional<uint64_t> HashProbe::estimatedRowSize( | |||
varSizeListColumnsStats.reserve(varSizedColumns.size()); | |||
for (uint32_t i = 0; i < varSizedColumns.size(); ++i) { | |||
auto statsOpt = columnStats(varSizedColumns[i]); | |||
if (!statsOpt.has_value()) { | |||
if (!statsOpt.has_value() || !statsOpt->isMinMaxColumnStatsValid()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/isMinMaxColumnStatsValid/minMaxColumnStatsValid/
velox/exec/RowContainer.cpp
Outdated
@@ -336,16 +334,38 @@ char* RowContainer::initializeRow(char* row, bool reuse) { | |||
return row; | |||
} | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put comment in header.
velox/exec/RowContainer.h
Outdated
minMaxStatsValid_ = false; | ||
} | ||
|
||
bool isMinMaxColumnStatsValid() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/isMinMaxColumnStatsValid/minMaxColumnStatsValid/
velox/exec/RowContainer.h
Outdated
@@ -229,6 +250,7 @@ class RowColumn { | |||
|
|||
uint32_t nonNullCount_{0}; | |||
uint32_t nullCount_{0}; | |||
bool minMaxStatsValid_{true}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this right after min max bytes?
velox/exec/RowContainer.h
Outdated
inline bool columnHasNulls(int32_t columnIndex) const { | ||
return columnHasNulls_[columnIndex]; | ||
return columnStats(columnIndex).has_value() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we always has column stats now so can we change columnStats() to const RowColumn::Stats& RowContainer::columnStats? thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unlikely but is it possible that it is initialized with empty rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nullopt is to indicate invalidated column stats. Now the invalidation will be of finer granularity, So the column stats for each column will always be valid (not nullopt) but just some fields inside column stats invalidated under some conditions.
velox/exec/RowContainer.h
Outdated
// fields. Index aligns with 'rowColumns_'. | ||
std::vector<RowColumn::Stats> rowColumnsStats_; | ||
// Indicates if the min/max column stats are valid. This is set to false | ||
// whenever a row is erased. | ||
bool rowColumnsStatsMinMaxValid_ = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/minMaxColumnStatsValid_/
We need to set it back to true in RowContainer::clear() and please also comment about this
velox/exec/RowContainer.cpp
Outdated
void RowContainer::eraseRows(folly::Range<char**> rows) { | ||
freeRowsExtraMemory(rows, /*freeNextRowVector=*/true); | ||
for (auto* row : rows) { | ||
VELOX_CHECK(!bits::isBitSet(row, freeFlagOffset_), "Double free of row"); | ||
bits::setBit(row, freeFlagOffset_); | ||
nextFree(row) = firstFreeRow_; | ||
firstFreeRow_ = row; | ||
|
||
removeRowFromColumnStats(row); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/removeRowFromColumnStats/removeRowColumnStats/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... it is actually meant to remove some rows info from the rowColumnStats. removeRowColumnStats sounds like we removed all the stats. wdyt?
// columnHasNulls_ flag is false by default. | ||
inline void updateColumnHasNulls(int32_t columnIndex, bool hasNulls) { | ||
columnHasNulls_[columnIndex] = columnHasNulls_[columnIndex] || hasNulls; | ||
void invalidateMinMaxColumnStats() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we use this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we have eraseRows or setAllNull
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe some additional ask. And maybe we can discuss this and potentially have a followup: Fixed column size stats can immune from erasure. Do we want to indicate in the column stats if this column is fixed/variable length column? We just need to set the stats size at construction time and skip updating them. Might save some cycles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I'll make a follow up task for this.
…ator#11860) Summary: PR to address facebookincubator#11741 - Removed the use of columnHasNulls in RowContainer and replaced them with row stats - Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated. Differential Revision: D67229925
329f9a9
to
8f9663b
Compare
This pull request was exported from Phabricator. Differential Revision: D67229925 |
…ator#11860) Summary: PR to address facebookincubator#11741 - Removed the use of columnHasNulls in RowContainer and replaced them with row stats - Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated. Differential Revision: D67229925
8f9663b
to
592bcf9
Compare
This pull request was exported from Phabricator. Differential Revision: D67229925 |
…ator#11860) Summary: PR to address facebookincubator#11741 - Removed the use of columnHasNulls in RowContainer and replaced them with row stats - Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated. Differential Revision: D67229925
592bcf9
to
4f90106
Compare
This pull request was exported from Phabricator. Differential Revision: D67229925 |
…ator#11860) Summary: PR to address facebookincubator#11741 - Removed the use of columnHasNulls in RowContainer and replaced them with row stats - Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated. Differential Revision: D67229925
4f90106
to
cb4e8af
Compare
This pull request was exported from Phabricator. Differential Revision: D67229925 |
…ator#11860) Summary: PR to address facebookincubator#11741 - Removed the use of columnHasNulls in RowContainer and replaced them with row stats - Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated. Differential Revision: D67229925
cb4e8af
to
605a4e2
Compare
This pull request was exported from Phabricator. Differential Revision: D67229925 |
…ator#11860) Summary: PR to address facebookincubator#11741 - Removed the use of columnHasNulls in RowContainer and replaced them with row stats - Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated. Differential Revision: D67229925
605a4e2
to
067ec67
Compare
This pull request was exported from Phabricator. Differential Revision: D67229925 |
…ator#11860) Summary: PR to address facebookincubator#11741 - Removed the use of columnHasNulls in RowContainer and replaced them with row stats - Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated. Differential Revision: D67229925
067ec67
to
8425802
Compare
This pull request was exported from Phabricator. Differential Revision: D67229925 |
velox/exec/RowContainer.h
Outdated
@@ -316,15 +336,28 @@ class RowContainer { | |||
: 0); | |||
} | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use //
…ator#11860) Summary: PR to address facebookincubator#11741 - Removed the use of columnHasNulls in RowContainer and replaced them with row stats - Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated. Differential Revision: D67229925
8425802
to
108558f
Compare
This pull request was exported from Phabricator. Differential Revision: D67229925 |
/// Merges multiple aggregated stats of the same column into a single one. | ||
static Stats merge(const std::vector<Stats>& statsList); | ||
|
||
private: | ||
// Aggregated stats for non-null rows of the column. | ||
int32_t minBytes_{0}; | ||
int32_t maxBytes_{0}; | ||
bool minMaxStatsValid_{true}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the name of this variable more general? We could potentially have some more stats that could be invalidable.
And shall we add comments to the part of the stats that are invalidable and the other part of the stats that are not? Something like:
int32_t minBytes_{0};
int32_t maxBytes_{0};
// Flag that ... above stats ...
bool fragileStatsValid_{true};
uint64_t sumBytes_{0};
uint32_t nonNullCount_{0};
uint32_t nullCount_{0};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to leave minMax name here for now to make it more explicit so people easily know what stats are invalidated. But in the future, if there are more stats that cannot be tracked, yes I think we should rename it.
velox/exec/RowContainer.h
Outdated
/// @param keepNullCount - If true, the null count is kept if originally null | ||
/// or increase if originally not null. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If true, the stats is updated as if the row is replaced with an all-null row.
Thanks @zation99 for the changes! Left some comments and we can have some discussions offline. |
108558f
to
c1e7abe
Compare
…ator#11860) Summary: PR to address facebookincubator#11741 - Removed the use of columnHasNulls in RowContainer and replaced them with row stats - Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated. Differential Revision: D67229925
This pull request was exported from Phabricator. Differential Revision: D67229925 |
…ator#11860) Summary: PR to address facebookincubator#11741 - Removed the use of columnHasNulls in RowContainer and replaced them with row stats - Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated. Differential Revision: D67229925
c1e7abe
to
94828f7
Compare
This pull request was exported from Phabricator. Differential Revision: D67229925 |
…ator#11860) Summary: PR to address facebookincubator#11741 - Removed the use of columnHasNulls in RowContainer and replaced them with row stats - Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated. Differential Revision: D67229925
94828f7
to
c7d3840
Compare
This pull request was exported from Phabricator. Differential Revision: D67229925 |
…ator#11860) Summary: PR to address facebookincubator#11741 - Removed the use of columnHasNulls in RowContainer and replaced them with row stats - Separate null count/sumBytes from minmax. In the case of rows erasure, only min/max is invalidated. Differential Revision: D67229925
c7d3840
to
1b5122a
Compare
This pull request was exported from Phabricator. Differential Revision: D67229925 |
Summary:
PR to address #11741
Differential Revision: D67229925