Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[iceberg] [native] Add support for V1 tables #21584

Merged
merged 4 commits into from
Feb 19, 2024

Conversation

imjalpreet
Copy link
Member

@imjalpreet imjalpreet commented Dec 20, 2023

Description

This PR includes the changes required to convert the Presto Iceberg plan and splits into the Velox plan and splits.

  1. Support converting Presto Iceberg Plan to Velox Plan
  2. Support converting Presto Iceberg Split to Velox Split
  3. Register the iceberg catalog as part of Hive Connectors

Motivation and Context

facebookincubator/velox#5977

Impact

With these changes, we will be able to read iceberg v1 tables from Prestissimo

Test Plan

Successfully ran all TPC-H queries with iceberg tables (both partitioned and non-partitioned)

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Introduce Iceberg Connector in Prestissimo

@imjalpreet imjalpreet requested a review from yingsu00 December 20, 2023 23:41
@imjalpreet imjalpreet self-assigned this Dec 20, 2023
@imjalpreet imjalpreet requested review from a team as code owners December 20, 2023 23:41
@imjalpreet imjalpreet marked this pull request as draft December 20, 2023 23:41
@yingsu00 yingsu00 added the iceberg Apache Iceberg related label Dec 21, 2023
Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @imjalpreet.

Would you be able to split out the protocol changes from the plan conversion changes in this PR ? That would be easier to review and maintain.

@imjalpreet
Copy link
Member Author

Would you be able to split out the protocol changes from the plan conversion changes in this PR ? That would be easier to review and maintain.

@aditi-pandit can you let me know how you want to split the changes? Did you mean in a separate PR? In this PR I had kept separate commits for the protocol and Presto to Velox plan conversion changes.

@aditi-pandit
Copy link
Contributor

Would you be able to split out the protocol changes from the plan conversion changes in this PR ? That would be easier to review and maintain.

@aditi-pandit can you let me know how you want to split the changes? Did you mean in a separate PR? In this PR I had kept separate commits for the protocol and Presto to Velox plan conversion changes.

@imjalpreet : Yeah, would be great to have a single PR with just the changes you have in commit Add Iceberg Connector to parse Java Iceberg Plan Fragment. After that checked-in its easier to look at how the protocol pieces are mapped into Prestissimo.

@majetideepak
Copy link
Collaborator

Please separate the Prestissimo bits into their own PR.

@imjalpreet imjalpreet force-pushed the icebergPrestissimoChanges branch 2 times, most recently from 62163c3 to 70146b3 Compare January 24, 2024 11:11
@imjalpreet imjalpreet marked this pull request as ready for review January 24, 2024 12:22
@@ -852,6 +853,11 @@ std::vector<std::string> PrestoServer::registerConnectors(
PRESTO_STARTUP_LOG(INFO) << "Registering catalog " << catalogName
<< " using connector " << connectorName;

// TODO: Temporary Change to use HiveDataSource for Iceberg Connector
if (connectorName == "iceberg") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this really happen? Is the connectorName not from presto-native-execution/etc/catalog? There is no iceberg.properties there and in hive.properties there is only connector.name=hive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we would need to add the iceberg catalog properties file in presto-native-execution/etc/catalog if someone wants to read iceberg tables

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to add a default file, I can push one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imjalpreet yes please add one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I think we should directly identify the connector name as hive in iceberg.properties. This catalog property file is for Presto Native, which doesn't have Iceberg connector, but just Hive connector.

connector.name=hive

@@ -142,6 +142,18 @@ std::shared_ptr<connector::ColumnHandle> toColumnHandle(
toRequiredSubfields(hiveColumn->requiredSubfields));
}

if (auto icebergColumn =
dynamic_cast<const protocol::IcebergColumnHandle*>(column)) {
// TODO(imjalpreet): Modify 'hiveType' argument of the 'HiveColumnHandle'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What needs to be done on hiveType?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of the hive connector, HiveColumnHandle(Presto) also has a parameter of type HiveType which is being used while creating the HiveColumnHandle velox object. We don't have the same in IcebergColumnHandle yet.

The prestissimo change was added as part of this commit in the case of hive connector: 6570b00

@imjalpreet imjalpreet force-pushed the icebergPrestissimoChanges branch 2 times, most recently from f0ff681 to 2e229ac Compare February 5, 2024 10:32
@imjalpreet imjalpreet force-pushed the icebergPrestissimoChanges branch 8 times, most recently from 5f71cfe to 51050fc Compare February 14, 2024 07:52
@majetideepak
Copy link
Collaborator

@imjalpreet is this ready for review?

@majetideepak
Copy link
Collaborator

Can you update the description?

@imjalpreet
Copy link
Member Author

@majetideepak Yes, this is ready for review. I have updated the description as well.

@@ -852,6 +853,11 @@ std::vector<std::string> PrestoServer::registerConnectors(
PRESTO_STARTUP_LOG(INFO) << "Registering catalog " << catalogName
<< " using connector " << connectorName;

// TODO: Temporary Change to use HiveDataSource for Iceberg Connector
if (connectorName == "iceberg") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I think we should directly identify the connector name as hive in iceberg.properties. This catalog property file is for Presto Native, which doesn't have Iceberg connector, but just Hive connector.

connector.name=hive

}
std::unordered_map<std::string, std::string> customSplitInfo;

std::shared_ptr<std::string> extraFileInfo;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove


std::shared_ptr<std::string> extraFileInfo;

std::optional<int> tableBucketNumber;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

partitionKeys,
tableBucketNumber,
customSplitInfo,
extraFileInfo),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just change to nullptr

icebergSplit->start,
icebergSplit->length,
partitionKeys,
tableBucketNumber,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullopt

Use Presto IcebergColumnHandle to create Velox HiveColumnHandle
Use Presto IcebergTableLayoutHandle/IcebergTableHandle to create Velox HiveTableHandle
@imjalpreet imjalpreet force-pushed the icebergPrestissimoChanges branch from 51050fc to f9dfb73 Compare February 15, 2024 08:28
Copy link
Contributor

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan @majetideepak @aditi-pandit @mbasmanova Do you want to review this PR again?

@majetideepak
Copy link
Collaborator

Successfully ran all TPC-H queries with iceberg tables (both partitioned and non-partitioned)

@imjalpreet, @yingsu00 can you share some details of the tests that were run? What scale factor was used? How many nodes? Was there any comparison made with the Hive Connector?

@majetideepak majetideepak changed the title [iceberg] [native] Introduce Iceberg Connector in Prestissimo [iceberg] [native] Add support for Iceberg Connector in Prestissimo Feb 15, 2024
@majetideepak
Copy link
Collaborator

I believe only Copy-on-Write tables are supported correct? Can you clarify the scope?

@imjalpreet imjalpreet force-pushed the icebergPrestissimoChanges branch from f9dfb73 to 1ec8e47 Compare February 15, 2024 18:59
@imjalpreet
Copy link
Member Author

imjalpreet commented Feb 15, 2024

can you share some details of the tests that were run? What scale factor was used? How many nodes? Was there any comparison made with the Hive Connector?

@majetideepak Currently, we have only done correctness verifications with smaller scale factors on local machines. We are in the process of testing it with sf1k and sf10k and working on generating the datasets. If you would like to see performance comparison for small scale factor, I can share that or we can wait for the tests with sf1k and sf10k.

I believe only Copy-on-Write tables are supported correct? Can you clarify the scope?

Yes, currently this PR would enable Prestissimo to be able to read iceberg v1 tables. Since v2 is not fully supported yet, merge on read tables won't be supported with this PR since they are only available with iceberg v2 tables.

@majetideepak majetideepak changed the title [iceberg] [native] Add support for Iceberg Connector in Prestissimo [iceberg] [native] Add support for Iceberg Connector (V1) in Prestissimo Feb 16, 2024
@majetideepak majetideepak changed the title [iceberg] [native] Add support for Iceberg Connector (V1) in Prestissimo [iceberg] [native] Add support for Iceberg Connector (V1) Feb 16, 2024
@majetideepak majetideepak changed the title [iceberg] [native] Add support for Iceberg Connector (V1) [iceberg] [native] Add support for Iceberg Connector V1 Feb 16, 2024
@yingsu00
Copy link
Contributor

@imjalpreet, @yingsu00 can you share some details of the tests that were run? What scale factor was used? How many nodes? Was there any comparison made with the Hive Connector?

@majetideepak Reetika has a branch with the TPCH/DS tiny e2e tests here https://github.com/agrawalreetika/prestodb/commits/icebergPrestissimo-tests-tpch-v1/ . Since it depends on this PR, we will submit a PR after this one is merged.

@majetideepak majetideepak changed the title [iceberg] [native] Add support for Iceberg Connector V1 [iceberg] [native] Add support for V1 tables Feb 16, 2024
@@ -0,0 +1,4 @@
# The Presto "iceberg" catalog is handled by the hive connector in Presto native execution.
connector.name=hive
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add a mapping from iceberg name to hive on the Velox side. Similar to how hive-hadoop2 is mapped. Not a blocker for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a PR here facebookincubator/velox#8765

@imjalpreet imjalpreet force-pushed the icebergPrestissimoChanges branch from 1ec8e47 to 7aa5ab1 Compare February 16, 2024 08:06
Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @imjalpreet

@yingsu00 yingsu00 merged commit e5ba4bb into prestodb:master Feb 19, 2024
59 checks passed
@tdcmeehan
Copy link
Contributor

Just in case: if this doesn't fail hard on V2 tables, can we add a small followup to do that until V2 support is finished?

@imjalpreet
Copy link
Member Author

Just in case: if this doesn't fail hard on V2 tables, can we add a small followup to do that until V2 support is finished?

@tdcmeehan Yes, sure. The follow-up PR for V2 support should also be ready soon. But I will raise the above fail-fast change in the meantime.

@wanglinsong wanglinsong mentioned this pull request May 1, 2024
48 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
iceberg Apache Iceberg related
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants