-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SplitReader refactor #8995
SplitReader refactor #8995
Conversation
✅ Deploy Preview for meta-velox canceled.
|
67dc076
to
227571f
Compare
@@ -93,7 +94,7 @@ class SplitReader { | |||
std::shared_ptr<common::MetadataFilter> metadataFilter, | |||
dwio::common::RuntimeStatistics& runtimeStats); | |||
|
|||
virtual uint64_t next(int64_t size, VectorPtr& output); | |||
virtual uint64_t next(uint64_t size, VectorPtr& output); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave it as signed for safer comparison
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave it as signed for safer comparison
@Yuhta The HiveDataSource::next(uint64_t size, velox::ContinueFuture& /*future*/)
and Reader::next(next(uint64_t size,velox::VectorPtr& result,..)
both use uint64_t, and size was directly passed to downstream functions and never gets changed. The the uint64_t -> int64_t -> uint64_t change may cause problem instead. Shall we make all the 3 function signatures align?
In fact, I should have been using uint64_t in SplitReader::next()
when creating it, but somehow don't remember why making it int64_t. Now I realized this is not desired and thus this fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK let's use uint64_t
to keep it consistent with data source and file readers in this case then. In general for new interfaces we should prefer signed types for any quantity variable, because it's much safer with the combination of subtraction and comparison
@Yuhta I have addressed all comments except the int64_t -> uint64_t one, which is waiting for your reply. Can you please review again? Thanks! |
@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yingsu00 thanks for the refactor % some minors.
@@ -141,6 +141,81 @@ void SplitReader::configureReaderOptions( | |||
void SplitReader::prepareSplit( | |||
std::shared_ptr<common::MetadataFilter> metadataFilter, | |||
dwio::common::RuntimeStatistics& runtimeStats) { | |||
createReader(); | |||
|
|||
if (testEmptySplit(runtimeStats)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it good or reliable to do empty check based on runtime stats?
s/testEmptySplit/checkIfEmptySplit/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaoxmeng The check is not BASED on runtimeStats, but it needs to update it if the split can be skipped.
Regarding the name, would isEmptySplit
or hasEmptySplit
or splitIsEmpty
or isSplitEmpty
be better? These names won't cause confusion whether returning true is empty or not empty.
|
||
// Note that this doesn't apply to Hudi tables. | ||
bool SplitReader::testEmptySplit( | ||
dwio::common::RuntimeStatistics& runtimeStats) { | ||
emptySplit_ = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
createReader might have set emptySplit_ to true and do we want to revert? And do we support to reuse the same SplitReader on the same split? If not, let's have more sanity check here? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaoxmeng Thanks for catching this. One SplitReader object is created for each split, and prepareSplit()
is supposed to be called once only. I have made the following change here:
bool SplitReader::splitIsEmpty(dwio::common::RuntimeStatistics& runtimeStats) {
// emptySplit_ may already be set if the data file is not found. In this case
// we don't need to test further.
if (emptySplit_) {
return true;
}
...
return DataSource::kUnknownRowSize; | ||
} | ||
|
||
auto size = baseRowReader_->estimatedRowSize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
const auto sizeOr = baseRowReader_->estimatedRowSize();
return sizeOr.valueOr(DataSource::kUnknownRowSize);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you update this? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
const std::shared_ptr<velox::connector::hive::HiveConnectorSplit>& | ||
hiveSplit, | ||
const std::shared_ptr<HiveTableHandle>& hiveTableHandle, | ||
const std::shared_ptr<common::ScanSpec>& scanSpec, | ||
const RowTypePtr readerOutputType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const RowTypePtr&
e0040c1
to
cbbfa10
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yingsu00 LGTM % nits. Thanks for the update!
const std::shared_ptr<const HiveConfig>& hiveConfig, | ||
const RowTypePtr& readerOutputType, | ||
const std::shared_ptr<io::IoStatistics>& ioStats, | ||
FileHandleFactory* const fileHandleFactory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/FileHandleFactory* const fileHandleFactory/FileHandleFactory* fileHandleFactory/
velox/connectors/hive/SplitReader.h
Outdated
partitionKeys, | ||
FileHandleFactory* fileHandleFactory, | ||
const std::shared_ptr<io::IoStatistics>& ioStats, | ||
FileHandleFactory* const fileHandleFactory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
velox/connectors/hive/SplitReader.h
Outdated
partitionKeys, | ||
FileHandleFactory* fileHandleFactory, | ||
const std::shared_ptr<io::IoStatistics>& ioStats, | ||
FileHandleFactory* const fileHandleFactory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return DataSource::kUnknownRowSize; | ||
} | ||
|
||
auto size = baseRowReader_->estimatedRowSize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you update this? Thanks!
if (baseReader_->numberOfRows() == 0) { | ||
emptySplit_ = true; | ||
return; | ||
bool SplitReader::splitIsEmpty(dwio::common::RuntimeStatistics& runtimeStats) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we call this checkIfSplitEmpty as we update emptySplit_ in this function and SplitReader already has emptySplit() method which returns emptySplit_. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, renamed to checkIfSplitIsEmpty()
createReader(); | ||
|
||
if (splitIsEmpty(runtimeStats)) { | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: VELOX_CHECK(emptySplit_);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
const std::shared_ptr<const HiveConfig>& hiveConfig, | ||
const RowTypePtr& readerOutputType, | ||
const std::shared_ptr<io::IoStatistics>& ioStats, | ||
FileHandleFactory* const fileHandleFactory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
createReader(); | ||
|
||
if (splitIsEmpty(runtimeStats)) { | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
1f88646
to
411c788
Compare
To prepare for the upcoming equality delete file read, we need to refactor the SplitReader and break prepareSplit() into several parts. This commit does this, and also include a couple of other code cleanups.
This commit reorders the function implementations in SplitReader.cpp to make it align with the order of their declarations in SplitReader.h, which is organized with their access levels and call orders.
Reorder SplitReader class fields
@Yuhta @xiaoxmeng All comments addressed, will you please reimport? Many thanks! |
@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Summary: To prepare for the upcoming equality delete file read, we need to refactor the SplitReader a bit. Pull Request resolved: facebookincubator#8995 Reviewed By: xiaoxmeng Differential Revision: D55072998 Pulled By: Yuhta fbshipit-source-id: 662459041b947d51ffaa98b57a50e4ebdd5b36e3
This reverts commit b5ea2d7.
Summary: To prepare for the upcoming equality delete file read, we need to refactor the SplitReader a bit. Pull Request resolved: facebookincubator#8995 Reviewed By: xiaoxmeng Differential Revision: D55072998 Pulled By: Yuhta fbshipit-source-id: 662459041b947d51ffaa98b57a50e4ebdd5b36e3
To prepare for the upcoming equality delete file read, we need to refactor the SplitReader a bit.