Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[native] Add support for Iceberg connector in presto_protocol #21765

Merged

Conversation

imjalpreet
Copy link
Member

@imjalpreet imjalpreet commented Jan 24, 2024

Description

Add Java classes from the Iceberg connector in presto_protocol.yml file.
Specialize IcebergColumnHandle to support 'operator<()'.

Depends on #21764

Motivation and Context

facebookincubator/velox#5977

Impact

With these changes, we will be able to read iceberg tables from Prestissimo

Test Plan

Successfully ran all TPC-H queries with iceberg tables (both partitioned and non-partitioned)

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add support for Iceberg connector in presto_protocol

@imjalpreet imjalpreet self-assigned this Jan 24, 2024
@imjalpreet imjalpreet requested review from a team as code owners January 24, 2024 10:15
@imjalpreet imjalpreet requested a review from presto-oss January 24, 2024 10:15
@imjalpreet imjalpreet force-pushed the icebergPrestissimoProtocolChanges branch from 5206127 to f08f85c Compare January 24, 2024 10:46
@yingsu00 yingsu00 added the iceberg Apache Iceberg related label Jan 25, 2024
@majetideepak
Copy link
Collaborator

@imjalpreet is this ready for review?

@imjalpreet
Copy link
Member Author

@majetideepak Yes the changes are all there in this PR. But need to rebase on master and make some minor updates based on new changes added in master. But I was facing a build error in my local most likely related to #21797 so I haven't pushed the rebased version of this branch yet.

Once I verify the changes after a successful build, I will push them as well. You can still review the PR since it mostly has all the changes or we can wait for the rebased version which has minor additions.

@majetideepak
Copy link
Collaborator

Why do we need so many special changes? Are we not able to translate the Java source files directly?
We should avoid adding to special as they are harder to maintain.

@@ -852,6 +853,11 @@ std::vector<std::string> PrestoServer::registerConnectors(
PRESTO_STARTUP_LOG(INFO) << "Registering catalog " << catalogName
<< " using connector " << connectorName;

// TODO: Temporary Change to use HiveDataSource for Iceberg Connector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the discussion in #21584 to here:

Add an iceberg.properties in presto-native-execution/etc/catalog with the following

# The Presto "iceberg" catalog is handled by the hive connector in Presto native execution.
connector.name=hive

cache.enabled=true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yingsu00 I have added the catalog file in #21584 since the final changes are part of that PR

@@ -44,6 +44,10 @@ void registerHiveConnectors() {
registerConnector("hive-hadoop2", "hive");
}

void registerIcebergConnector() {
registerConnector("iceberg", "iceberg");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this as is, but add a comment saying the connectorName and connectorKey are for Presto, not Presto native.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yingsu00 I have made the change as we had discussed earlier and moved the iceberg registration as part of the hive connectors

@mbasmanova
Copy link
Contributor

Why do we need so many special changes? Are we not able to translate the Java source files directly? We should avoid adding to special as they are harder to maintain.

@majetideepak @tdcmeehan Deepak, Tim, do we have any ongoing efforts to switch from JSON to Thrift for communication? JSON-based protocols are pretty challenging to maintain.

@majetideepak
Copy link
Collaborator

do we have any ongoing efforts to switch from JSON to Thrift for communication?

@mbasmanova There is no activity on this. Someone at Meta added the initial bits, that's all. Do you know who at Meta worked on the initial design? From the README, the incremental transition from JSON to thrift seems to be very involved. I wonder if there is a simpler way.
https://github.com/prestodb/presto/tree/master/presto-native-execution/presto_cpp/main/thrift

@mbasmanova
Copy link
Contributor

@majetideepak Tim and @ajaygeorge would have most context on this.

@tdcmeehan
Copy link
Contributor

@mbasmanova For clarity, there are a few reasons this is not scalable:

  1. We must manually update the protocol every time connectors change or there is a new connector.
  2. This protocol must be built-in to the Prestissimo binaries.

While 1 is problematic, I believe 2 is the problem that should be solved first (if we must choose between the two), because while 1 is an inconvenience, 2 is more than an inconvenience: it's a blocker from real connector SPI support in C++.

Consider that, to solve 2, we must make it so that connector data structures are not built, but deserialized in a custom way (which is how it works in Java, albeit tied JSON). The idea is to make it so that connector data structures are serialized in a custom manner, and allow connectors to self-specify how to serialize and deserialize them. This may include Thrift, but it will be up to the connector how to accomplish that.

When thought of in this way, you're left with two parts to serialize and deserialize: the outer shell of the task update request, and the inner structures used by connectors. The outer shell may be moved to Thrift. We would love to get community help to migrate this as well, but I don't think it's the top priority since it changes more slowly and it doesn't block federation use cases in Prestissimo.

@tdcmeehan
Copy link
Contributor

tdcmeehan commented Jan 30, 2024

To answer your question, we'll be working on 2, and I'll be directing community members toward 1 (see: #19839), but I think this PR can continue as it is for now as a little extra code is definitely worth it to open up Prestissimo for Iceberg benchmarking.

@imjalpreet imjalpreet force-pushed the icebergPrestissimoProtocolChanges branch 2 times, most recently from 07000e6 to 4bd39ae Compare February 1, 2024 21:07
@imjalpreet imjalpreet requested a review from yingsu00 February 1, 2024 21:07
@imjalpreet imjalpreet mentioned this pull request Feb 3, 2024
6 tasks
@imjalpreet imjalpreet force-pushed the icebergPrestissimoProtocolChanges branch from 4bd39ae to e1ad4c3 Compare February 5, 2024 10:30
@steveburnett
Copy link
Contributor

Nit: following the release note guidelines, perhaps change the release note entry to "Add" instead of "Introduce".

*/
namespace facebook::presto::protocol {

struct BaseHiveColumnHandle : public ColumnHandle {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you add comments to explain why all these classes cannot be auto-generated?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova I am working with @imjalpreet to try and automate these as well. We will add comments if we are unable to auto-generate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@majetideepak Got it. Thank you for sharing. Great to know that you are trying to avoid manual overrides. Thanks.

Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to automate the Column(Table)Handle and TableLayout classes.

Are we blocked on them because of the use of inheritance e.g. struct HiveTableHandle : public BaseHiveTableHandle ?

@imjalpreet
Copy link
Member Author

Are we blocked on them because of the use of inheritance e.g. struct HiveTableHandle : public BaseHiveTableHandle ?

@aditi-pandit Yes, we are trying to see if it would be possible to somehow incorporate multi-level inheritance in these classes (ColumnHandle, TableHandle, and TableLayoutHandle) without the introduction of special classes.

@imjalpreet imjalpreet force-pushed the icebergPrestissimoProtocolChanges branch 2 times, most recently from 63741c4 to 8294b87 Compare February 9, 2024 15:05
@imjalpreet imjalpreet force-pushed the icebergPrestissimoProtocolChanges branch 2 times, most recently from 3277916 to 6280b7c Compare February 9, 2024 15:33
@yingsu00
Copy link
Contributor

yingsu00 commented Feb 9, 2024

@majetideepak @mbasmanova @tdcmeehan @aditi-pandit Thanks everyone for chiming in. @imjalpreet and I had some offline discussion today. We think there are two ways to solve this problem:

  1. Make presto protocol to support multi-level inheritance (This PR)
    In the last a couple of days, @imjalpreet has been trying to modify the Ser/De code to make it support multi-level inheritance and was able to reduce most extra specials. In this approach, presto_protocol.yml was modified to add a "super" member in the subclasses lists in addition to the current "name" and "key" members. We also had to move the partitionColumns back from BaseHiveTableLayoutHandle to HiveTableLayoutHandle and IcebergTableLayoutHandle to resolve the problem the partitionColumns could not be deserialized because it was List and the deserializer doesn't know its actual type. Jalpreet has updated this PR with this approach.

  2. Only serialize leaf classes
    In this approach , BaseHiveTableLayoutHandle and BaseHiveColumnHandle will be removed from presto_protocol, and only HiveTableLayoutHandle, IcebergTableLayoutHandle and HiveColumnHandle, IcebergColumnHandle would be serialized/deserialized. Then we will remove the references of these base classes in presto-native-execution (there is only one real callsite and it can be removed). @imjalpreet can also send a PR on this.

Both approachs need additional work. Before we proceed, we'd like to get your opinions which route we are going to take. IMHO the first approach is more generic and may benefit future changes. But we may need to use a slightly different approach there: instead of adding a "super" member in presto_protocol.yml, I think it's better to list the classes in a separate entry:

    ConnectorTableHandle:
      super: JsonEncodedSubclass
      subclasses:
        - { name: BaseHiveTableHandle,      key: base-hive }
        - { name: TpchTableHandle,          key: tpch }
        -
    BaseHiveTableHandle:
      super: JsonEncodedSubclass
      subclasses:
        - { name: HiveTableHandle,          key: hive }
        - { name: IcebergTableHandle,       key: hive-iceberg }

This needs additional work in order to build the tree structure. Your opinions on which way to go is very appreciated.

@majetideepak
Copy link
Collaborator

@imjalpreet and @yingsu00 thanks for working on this and finding solutions to avoid manual overrides.
Among the two approaches, I prefer the latter Only serialize leaf classes since I see no use in maintaining the inheritance information in the protocol.
In the future, we can add support for' super' when there is a use-case for the inheritance details.

Both approachs need additional work.

Can you share what additional work is required for the second approach?

@imjalpreet imjalpreet force-pushed the icebergPrestissimoProtocolChanges branch from 6280b7c to 8be0e31 Compare February 12, 2024 15:55
@imjalpreet
Copy link
Member Author

imjalpreet commented Feb 12, 2024

@majetideepak I have updated the PR to remove the usage of Base classes and excluded inheritance in the protocol.

Based on these changes I have also updated the implementation in the follow-up PR: #21584

Can you share what additional work is required for the second approach?

I think Ying was referring to the implementation changes for the current PRs. There are no changes required in protocol generation scripts like the inheritance approach.

Please have a look at the updated changes and let me know if these look good or in case there are any questions.

@yingsu00
Copy link
Contributor

@majetideepak Jalpreet has created a PR for each way:

  1. Multiple inheritance: [iceberg] [native] Introduce Iceberg Connector in Prestissimo (with multi-level inheritance)
  2. Only leaf classes: This PR.

@tdcmeehan
Copy link
Contributor

I agree with @majetideepak. It sounds like for option 1, that's like trying to replicate Jackson's polymorphic type support into our protocol generation scripts, but it doesn't really solve the problem that these data structures must be known upfront in the engine. Rather than build more sophistication into the protocol, I'd rather fix the underlying issue and try to leave protocol generation scripts as unchanged as possible until we do, which makes me favor option 2 more.

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imjalpreet Thank you for extending the protocol without a lot of specialization.
Can you confirm the presto_protocol.h and presto_protocol.cpp files have not been manually edited?

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imjalpreet please fix the commit title and description. Thanks!

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imjalpreet This version is looking nice, clean and simple. Curious, what was the trick that allowed to remove the specializations.

@imjalpreet
Copy link
Member Author

Thank you for extending the protocol without a lot of specialization. Can you confirm the presto_protocol.h and presto_protocol.cpp files have not been manually edited?

@majetideepak yes, the files have not been manually edited.

please fix the commit title and description. Thanks!

Can you please let me know what would you like to fix in the commit message?

@imjalpreet
Copy link
Member Author

This version is looking nice, clean and simple. Curious, what was the trick that allowed to remove the specializations.

@mbasmanova There are some enums from the iceberg library that are required in the protocol. We have created wrappers for them on the java side so that they can be automated using the yaml file. Apart from that, earlier we were trying to replicate the multi-level inheritance present on the java side in prestissimo as well. But since currently we don't have a big usage of the inheritance details in Prestissimo, we have updated our implementation to remove their usage in Prestissimo. If in the future, we see that inheritance will be useful, we can have a look at #21905 for one of the ways we can update the protocol generation process to simplify the inclusion of multi-level inheritance.

@mbasmanova
Copy link
Contributor

@imjalpreet Got it. Thank you for explaining.

@majetideepak majetideepak changed the title [native] Iceberg prestissimo protocol changes [native] Add support for Iceberg connector in presto_protocol Feb 13, 2024
@majetideepak
Copy link
Collaborator

Can you please let me know what would you like to fix in the commit message?

@imjalpreet I updated the PR title and description. We can use the same for the commit title and description as well.

@imjalpreet imjalpreet force-pushed the icebergPrestissimoProtocolChanges branch from 8be0e31 to 1cf0223 Compare February 13, 2024 09:18
@majetideepak
Copy link
Collaborator

@imjalpreet can you add the description to the commit body as well? You need to add an empty line after the title and add the body.
See example below.

commit 1bd7843cd321023d04c6cba734b136b6c552ea96
Author: Deepak Majeti <[email protected]>
Date:   Mon Feb 5 01:49:44 2024 -0500

    [native] Remove RE2 dependency from centos setup script
    
    RE2 is installed in the Velox centos setup script

@majetideepak
Copy link
Collaborator

Just these 2 lines in the commit body.

Add Java classes from the Iceberg connector in presto_protocol.yml file.
Specialize IcebergColumnHandle to support 'operator<()'.

@imjalpreet imjalpreet force-pushed the icebergPrestissimoProtocolChanges branch from 1cf0223 to 8def00f Compare February 13, 2024 09:25
Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @imjalpreet

Copy link
Contributor

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imjalpreet
I see that the Json constructor for BaseHiveTableLayoutHandle is still there. Is it still needed? I don't see it's being used. Can you please add a separate commit or PR to remove it If it's no longer needed?

BaseHiveColumnHandle fields are still json annotated. Are the annotations required?

presto-native-execution/presto_cpp/main/PrestoServer.cpp Outdated Show resolved Hide resolved
@@ -177,6 +181,7 @@ JavaClasses:
- presto-hive-metastore/src/main/java/com/facebook/presto/hive/BucketFunctionType.java
- presto-hive-common/src/main/java/com/facebook/presto/hive/CacheQuotaRequirement.java
- presto-hive-common/src/main/java/com/facebook/presto/hive/CacheQuotaScope.java
- presto-hive-common/src/main/java/com/facebook/presto/hive/BaseHiveColumnHandle.java
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is still needed. ColumnType enum is part of this class and it is required to be included in the protocol.

@@ -12,6 +12,9 @@
* limitations under the License.
*/

// HiveColumnHandle is special since we require an implementation of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
HiveColumnHandle is special since it needs an implementation of

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will make this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment

@imjalpreet
Copy link
Member Author

I see that the Json constructor for BaseHiveTableLayoutHandle is still there. Is it still needed? I don't see it's being used. Can you please add a separate commit or PR to remove it If it's no longer needed?

BaseHiveColumnHandle fields are still json annotated. Are the annotations required?

@yingsu00 Since, this PR only includes protocol changes I haven't refactored any presto classes here.

I will have a look at the Base* classes in Presto, verify the need for JSON constructors and refactor them as needed in a separate PR.

Add Java classes from the Iceberg connector in presto_protocol.yml file.
Specialize IcebergColumnHandle to support 'operator<()'.
@imjalpreet imjalpreet force-pushed the icebergPrestissimoProtocolChanges branch from 8def00f to 2df519d Compare February 13, 2024 16:28
@yingsu00 yingsu00 merged commit 1a4e945 into prestodb:master Feb 14, 2024
59 checks passed
@wanglinsong wanglinsong mentioned this pull request May 1, 2024
48 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
iceberg Apache Iceberg related
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants