Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SplitReader refactor #8995

Closed
wants to merge 4 commits into from
Closed

Conversation

yingsu00
Copy link
Collaborator

@yingsu00 yingsu00 commented Mar 7, 2024

To prepare for the upcoming equality delete file read, we need to refactor the SplitReader a bit.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 7, 2024
Copy link

netlify bot commented Mar 7, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit db8e894
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/660bfb7d954a0e0007f965c3

@yingsu00 yingsu00 force-pushed the refactor branch 6 times, most recently from 67dc076 to 227571f Compare March 12, 2024 10:42
@yingsu00 yingsu00 marked this pull request as ready for review March 12, 2024 15:38
@yingsu00 yingsu00 requested a review from Yuhta March 12, 2024 15:39
@mbasmanova mbasmanova requested a review from xiaoxmeng March 13, 2024 07:16
@@ -93,7 +94,7 @@ class SplitReader {
std::shared_ptr<common::MetadataFilter> metadataFilter,
dwio::common::RuntimeStatistics& runtimeStats);

virtual uint64_t next(int64_t size, VectorPtr& output);
virtual uint64_t next(uint64_t size, VectorPtr& output);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave it as signed for safer comparison

Copy link
Collaborator Author

@yingsu00 yingsu00 Mar 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave it as signed for safer comparison

@Yuhta The HiveDataSource::next(uint64_t size, velox::ContinueFuture& /*future*/) and Reader::next(next(uint64_t size,velox::VectorPtr& result,..) both use uint64_t, and size was directly passed to downstream functions and never gets changed. The the uint64_t -> int64_t -> uint64_t change may cause problem instead. Shall we make all the 3 function signatures align?

In fact, I should have been using uint64_t in SplitReader::next() when creating it, but somehow don't remember why making it int64_t. Now I realized this is not desired and thus this fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK let's use uint64_t to keep it consistent with data source and file readers in this case then. In general for new interfaces we should prefer signed types for any quantity variable, because it's much safer with the combination of subtraction and comparison

@yingsu00
Copy link
Collaborator Author

@Yuhta I have addressed all comments except the int64_t -> uint64_t one, which is waiting for your reply. Can you please review again? Thanks!

@facebook-github-bot
Copy link
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yingsu00 thanks for the refactor % some minors.

@@ -141,6 +141,81 @@ void SplitReader::configureReaderOptions(
void SplitReader::prepareSplit(
std::shared_ptr<common::MetadataFilter> metadataFilter,
dwio::common::RuntimeStatistics& runtimeStats) {
createReader();

if (testEmptySplit(runtimeStats)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it good or reliable to do empty check based on runtime stats?

s/testEmptySplit/checkIfEmptySplit/

Copy link
Collaborator Author

@yingsu00 yingsu00 Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiaoxmeng The check is not BASED on runtimeStats, but it needs to update it if the split can be skipped.
Regarding the name, would isEmptySplit or hasEmptySplit or splitIsEmpty or isSplitEmpty be better? These names won't cause confusion whether returning true is empty or not empty.


// Note that this doesn't apply to Hudi tables.
bool SplitReader::testEmptySplit(
dwio::common::RuntimeStatistics& runtimeStats) {
emptySplit_ = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createReader might have set emptySplit_ to true and do we want to revert? And do we support to reuse the same SplitReader on the same split? If not, let's have more sanity check here? Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiaoxmeng Thanks for catching this. One SplitReader object is created for each split, and prepareSplit() is supposed to be called once only. I have made the following change here:

bool SplitReader::splitIsEmpty(dwio::common::RuntimeStatistics& runtimeStats) {
  // emptySplit_ may already be set if the data file is not found. In this case
  // we don't need to test further.
  if (emptySplit_) {
    return true;
  }
...

return DataSource::kUnknownRowSize;
}

auto size = baseRowReader_->estimatedRowSize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

const auto sizeOr = baseRowReader_->estimatedRowSize();
return sizeOr.valueOr(DataSource::kUnknownRowSize);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you update this? Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

const std::shared_ptr<velox::connector::hive::HiveConnectorSplit>&
hiveSplit,
const std::shared_ptr<HiveTableHandle>& hiveTableHandle,
const std::shared_ptr<common::ScanSpec>& scanSpec,
const RowTypePtr readerOutputType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const RowTypePtr&

@yingsu00 yingsu00 force-pushed the refactor branch 2 times, most recently from e0040c1 to cbbfa10 Compare March 21, 2024 13:41
Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yingsu00 LGTM % nits. Thanks for the update!

const std::shared_ptr<const HiveConfig>& hiveConfig,
const RowTypePtr& readerOutputType,
const std::shared_ptr<io::IoStatistics>& ioStats,
FileHandleFactory* const fileHandleFactory,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/FileHandleFactory* const fileHandleFactory/FileHandleFactory* fileHandleFactory/

partitionKeys,
FileHandleFactory* fileHandleFactory,
const std::shared_ptr<io::IoStatistics>& ioStats,
FileHandleFactory* const fileHandleFactory,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

partitionKeys,
FileHandleFactory* fileHandleFactory,
const std::shared_ptr<io::IoStatistics>& ioStats,
FileHandleFactory* const fileHandleFactory,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return DataSource::kUnknownRowSize;
}

auto size = baseRowReader_->estimatedRowSize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you update this? Thanks!

if (baseReader_->numberOfRows() == 0) {
emptySplit_ = true;
return;
bool SplitReader::splitIsEmpty(dwio::common::RuntimeStatistics& runtimeStats) {
Copy link
Contributor

@xiaoxmeng xiaoxmeng Mar 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we call this checkIfSplitEmpty as we update emptySplit_ in this function and SplitReader already has emptySplit() method which returns emptySplit_. Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, renamed to checkIfSplitIsEmpty()

createReader();

if (splitIsEmpty(runtimeStats)) {
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:  VELOX_CHECK(emptySplit_);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

const std::shared_ptr<const HiveConfig>& hiveConfig,
const RowTypePtr& readerOutputType,
const std::shared_ptr<io::IoStatistics>& ioStats,
FileHandleFactory* const fileHandleFactory,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

createReader();

if (splitIsEmpty(runtimeStats)) {
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@yingsu00 yingsu00 force-pushed the refactor branch 5 times, most recently from 1f88646 to 411c788 Compare April 2, 2024 10:10
yingsu00 added 4 commits April 2, 2024 20:34
To prepare for the upcoming equality delete file read, we need to
refactor the SplitReader and break prepareSplit() into several parts.
This commit does this, and also include a couple of other code cleanups.
This commit reorders the function implementations in SplitReader.cpp to
make it align with the order of their declarations in SplitReader.h,
which is organized with their access levels and call orders.
Reorder SplitReader class fields
@yingsu00
Copy link
Collaborator Author

yingsu00 commented Apr 2, 2024

@Yuhta @xiaoxmeng All comments addressed, will you please reimport? Many thanks!

@facebook-github-bot
Copy link
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@Yuhta merged this pull request in b5ea2d7.

Copy link

Conbench analyzed the 1 benchmark run on commit b5ea2d72.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

yanngyoung pushed a commit to yanngyoung/velox that referenced this pull request Apr 12, 2024
Summary:
To prepare for the upcoming equality delete file read, we need to refactor the SplitReader a bit.

Pull Request resolved: facebookincubator#8995

Reviewed By: xiaoxmeng

Differential Revision: D55072998

Pulled By: Yuhta

fbshipit-source-id: 662459041b947d51ffaa98b57a50e4ebdd5b36e3
marin-ma added a commit to marin-ma/velox-oap that referenced this pull request Apr 15, 2024
Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
Summary:
To prepare for the upcoming equality delete file read, we need to refactor the SplitReader a bit.

Pull Request resolved: facebookincubator#8995

Reviewed By: xiaoxmeng

Differential Revision: D55072998

Pulled By: Yuhta

fbshipit-source-id: 662459041b947d51ffaa98b57a50e4ebdd5b36e3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants