Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for writing iceberg tables #10996

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

imjalpreet
Copy link
Contributor

Introduce IcebergPageSink

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 13, 2024
Copy link

netlify bot commented Sep 13, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 967c556
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6740142bc231ce000847c953


virtual ~IcebergInsertTableHandle() = default;

std::shared_ptr<const VeloxIcebergSchema> schema() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this used?

@Yuhta Yuhta requested a review from xiaoxmeng September 16, 2024 18:07
#include "velox/connectors/hive/iceberg/IcebergDataSink.h"

#include <utility>
#include "velox/common/base/Counters.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are lots of headers unused. Please remove.

velox/connectors/hive/iceberg/IcebergDataSink.cpp Outdated Show resolved Hide resolved
Introduce IcebergPageSink
velox/connectors/hive/iceberg/IcebergDataSink.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
return;
}

dataChannels_ = getDataChannels(partitionChannels_, inputType_->size());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use insertTableHandle number of channels. partitionChannels_ come from it. I'm not sure inputType_ size is always the same as insertTableHandle's.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used inputType_ since we are already using it in HiveDataSink, I can verify how inputType_ is derived and whether it will always be the same as insertTableHandle columns.

const std::vector<column_index_t>& partitionChannels,
const column_index_t childrenSize) const {
// Create a vector of all possible channels
std::vector<column_index_t> dataChannels(childrenSize);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to add dataChannels_ field to IcebergDataSink than allocating a vector every time appendData() is called.

@@ -808,7 +786,7 @@ HiveWriterId HiveDataSink::getWriterId(size_t row) const {
return HiveWriterId{partitionId, bucketId};
}

void HiveDataSink::splitInputRowsAndEnsureWriters() {
void HiveDataSink::splitInputRowsAndEnsureWriters(RowVectorPtr input) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment out input (RowVectorPtr /* input */) {

insertTableHandle_);

for (auto i = 0; i < partitionChannels_.size(); ++i) {
auto type =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I pushed it by mistake in this commit, it was part of the changes for writing tables with partition transforms. I will remove it from this commit.

icebergInsertTableHandle->inputColumns()[partitionChannels_[i]]
->dataType();
auto block = input->childAt(partitionChannels_[i]);
partitionValues.insert(partitionValues.begin() + i, block->toString(row));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitionValues[i] = block->toString(row);

partitionValues.insert(partitionValues.begin() + i, block->toString(row));
}

partitionData_.insert(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitionData_[index] = std::make_shared(partitionValues);


void IcebergDataSink::extendBuffersForPartitionedTables() {
// Extends the buffer used for partition rows calculations.
partitionSizes_.emplace_back(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace the 3 lines to call HiveDataSink::extendBuffersForPartitionedTables()

@imjalpreet imjalpreet force-pushed the icebergWriter branch 2 times, most recently from f3565a6 to e3aa199 Compare November 14, 2024 23:18
};

class PartitionData {
private:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the private fields to the end.


#include "velox/common/base/Fs.h"
#include "velox/connectors/hive/TableHandle.h"
#include "velox/exec/OperatorUtils.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used

@zhouyuan
Copy link
Contributor

FWIW, there's a new project to add C++ support for iceberg:
apache/iceberg-cpp#2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants