Skip to content

Commit

Permalink
Avoid recreating empty output vector repeatedly in HiveDataSource (#6942
Browse files Browse the repository at this point in the history
)

Summary:
Pull Request resolved: #6942

In some low selectivity query with huge struct columns, we see the
empty output gets destroyed and recreated repeatedly and making the query more
than 4 times slower.  Fix this by caching the empty output vector.

Reviewed By: oerling

Differential Revision: D50017249

fbshipit-source-id: 5c387ad1ee48ed7268b2c15570040cf7854c7aa9
  • Loading branch information
Yuhta authored and facebook-github-bot committed Oct 6, 2023
1 parent e2a61da commit 4364ac5
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 2 deletions.
4 changes: 2 additions & 2 deletions velox/connectors/hive/HiveDataSource.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -776,7 +776,7 @@ std::optional<RowVectorPtr> HiveDataSource::next(
auto rowsRemaining = output_->size();
if (rowsRemaining == 0) {
// no rows passed the pushed down filters.
return RowVector::createEmpty(outputType_, pool_);
return getEmptyOutput();
}

auto rowVector = std::dynamic_pointer_cast<RowVector>(output_);
Expand All @@ -791,7 +791,7 @@ std::optional<RowVectorPtr> HiveDataSource::next(
VELOX_CHECK_LE(rowsRemaining, rowsScanned);
if (rowsRemaining == 0) {
// No rows passed the remaining filter.
return RowVector::createEmpty(outputType_, pool_);
return getEmptyOutput();
}

if (rowsRemaining < rowVector->size()) {
Expand Down
8 changes: 8 additions & 0 deletions velox/connectors/hive/HiveDataSource.h
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,13 @@ class HiveDataSource : public DataSource {
void parseSerdeParameters(
const std::unordered_map<std::string, std::string>& serdeParameters);

const RowVectorPtr& getEmptyOutput() {
if (!emptyOutput_) {
emptyOutput_ = RowVector::createEmpty(outputType_, pool_);
}
return emptyOutput_;
}

const RowTypePtr outputType_;
// Column handles for the partition key columns keyed on partition key column
// name.
Expand All @@ -160,6 +167,7 @@ class HiveDataSource : public DataSource {
std::unique_ptr<dwio::common::Reader> reader_;
std::unique_ptr<exec::ExprSet> remainingFilterExprSet_;
bool emptySplit_;
RowVectorPtr emptyOutput_;

dwio::common::RuntimeStatistics runtimeStats_;

Expand Down

0 comments on commit 4364ac5

Please sign in to comment.