-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[native] Add support for Iceberg connector in presto_protocol #21765
[native] Add support for Iceberg connector in presto_protocol #21765
Conversation
5206127
to
f08f85c
Compare
@imjalpreet is this ready for review? |
@majetideepak Yes the changes are all there in this PR. But need to rebase on master and make some minor updates based on new changes added in master. But I was facing a build error in my local most likely related to #21797 so I haven't pushed the rebased version of this branch yet. Once I verify the changes after a successful build, I will push them as well. You can still review the PR since it mostly has all the changes or we can wait for the rebased version which has minor additions. |
Why do we need so many |
@@ -852,6 +853,11 @@ std::vector<std::string> PrestoServer::registerConnectors( | |||
PRESTO_STARTUP_LOG(INFO) << "Registering catalog " << catalogName | |||
<< " using connector " << connectorName; | |||
|
|||
// TODO: Temporary Change to use HiveDataSource for Iceberg Connector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving the discussion in #21584 to here:
Add an iceberg.properties in presto-native-execution/etc/catalog with the following
# The Presto "iceberg" catalog is handled by the hive connector in Presto native execution.
connector.name=hive
cache.enabled=true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -44,6 +44,10 @@ void registerHiveConnectors() { | |||
registerConnector("hive-hadoop2", "hive"); | |||
} | |||
|
|||
void registerIcebergConnector() { | |||
registerConnector("iceberg", "iceberg"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep this as is, but add a comment saying the connectorName and connectorKey are for Presto, not Presto native.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yingsu00 I have made the change as we had discussed earlier and moved the iceberg registration as part of the hive connectors
@majetideepak @tdcmeehan Deepak, Tim, do we have any ongoing efforts to switch from JSON to Thrift for communication? JSON-based protocols are pretty challenging to maintain. |
@mbasmanova There is no activity on this. Someone at Meta added the initial bits, that's all. Do you know who at Meta worked on the initial design? From the README, the incremental transition from JSON to thrift seems to be very involved. I wonder if there is a simpler way. |
@majetideepak Tim and @ajaygeorge would have most context on this. |
@mbasmanova For clarity, there are a few reasons this is not scalable:
While 1 is problematic, I believe 2 is the problem that should be solved first (if we must choose between the two), because while 1 is an inconvenience, 2 is more than an inconvenience: it's a blocker from real connector SPI support in C++. Consider that, to solve 2, we must make it so that connector data structures are not built, but deserialized in a custom way (which is how it works in Java, albeit tied JSON). The idea is to make it so that connector data structures are serialized in a custom manner, and allow connectors to self-specify how to serialize and deserialize them. This may include Thrift, but it will be up to the connector how to accomplish that. When thought of in this way, you're left with two parts to serialize and deserialize: the outer shell of the task update request, and the inner structures used by connectors. The outer shell may be moved to Thrift. We would love to get community help to migrate this as well, but I don't think it's the top priority since it changes more slowly and it doesn't block federation use cases in Prestissimo. |
To answer your question, we'll be working on 2, and I'll be directing community members toward 1 (see: #19839), but I think this PR can continue as it is for now as a little extra code is definitely worth it to open up Prestissimo for Iceberg benchmarking. |
07000e6
to
4bd39ae
Compare
4bd39ae
to
e1ad4c3
Compare
Nit: following the release note guidelines, perhaps change the release note entry to "Add" instead of "Introduce". |
*/ | ||
namespace facebook::presto::protocol { | ||
|
||
struct BaseHiveColumnHandle : public ColumnHandle { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you add comments to explain why all these classes cannot be auto-generated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova I am working with @imjalpreet to try and automate these as well. We will add comments if we are unable to auto-generate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@majetideepak Got it. Thank you for sharing. Great to know that you are trying to avoid manual overrides. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try to automate the Column(Table)Handle and TableLayout classes.
Are we blocked on them because of the use of inheritance e.g. struct HiveTableHandle : public BaseHiveTableHandle
?
presto-native-execution/presto_cpp/presto_protocol/special/ColumnType.cpp.inc
Outdated
Show resolved
Hide resolved
@aditi-pandit Yes, we are trying to see if it would be possible to somehow incorporate multi-level inheritance in these classes (ColumnHandle, TableHandle, and TableLayoutHandle) without the introduction of special classes. |
63741c4
to
8294b87
Compare
3277916
to
6280b7c
Compare
@majetideepak @mbasmanova @tdcmeehan @aditi-pandit Thanks everyone for chiming in. @imjalpreet and I had some offline discussion today. We think there are two ways to solve this problem:
Both approachs need additional work. Before we proceed, we'd like to get your opinions which route we are going to take. IMHO the first approach is more generic and may benefit future changes. But we may need to use a slightly different approach there: instead of adding a "super" member in presto_protocol.yml, I think it's better to list the classes in a separate entry:
This needs additional work in order to build the tree structure. Your opinions on which way to go is very appreciated. |
@imjalpreet and @yingsu00 thanks for working on this and finding solutions to avoid manual overrides.
Can you share what additional work is required for the second approach? |
6280b7c
to
8be0e31
Compare
@majetideepak I have updated the PR to remove the usage of Base classes and excluded inheritance in the protocol. Based on these changes I have also updated the implementation in the follow-up PR: #21584
I think Ying was referring to the implementation changes for the current PRs. There are no changes required in protocol generation scripts like the inheritance approach. Please have a look at the updated changes and let me know if these look good or in case there are any questions. |
@majetideepak Jalpreet has created a PR for each way:
|
I agree with @majetideepak. It sounds like for option 1, that's like trying to replicate Jackson's polymorphic type support into our protocol generation scripts, but it doesn't really solve the problem that these data structures must be known upfront in the engine. Rather than build more sophistication into the protocol, I'd rather fix the underlying issue and try to leave protocol generation scripts as unchanged as possible until we do, which makes me favor option 2 more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imjalpreet Thank you for extending the protocol without a lot of specialization.
Can you confirm the presto_protocol.h
and presto_protocol.cpp
files have not been manually edited?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imjalpreet please fix the commit title and description. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imjalpreet This version is looking nice, clean and simple. Curious, what was the trick that allowed to remove the specializations.
@majetideepak yes, the files have not been manually edited.
Can you please let me know what would you like to fix in the commit message? |
@mbasmanova There are some enums from the iceberg library that are required in the protocol. We have created wrappers for them on the java side so that they can be automated using the yaml file. Apart from that, earlier we were trying to replicate the multi-level inheritance present on the java side in prestissimo as well. But since currently we don't have a big usage of the inheritance details in Prestissimo, we have updated our implementation to remove their usage in Prestissimo. If in the future, we see that inheritance will be useful, we can have a look at #21905 for one of the ways we can update the protocol generation process to simplify the inclusion of multi-level inheritance. |
@imjalpreet Got it. Thank you for explaining. |
@imjalpreet I updated the PR title and description. We can use the same for the commit title and description as well. |
8be0e31
to
1cf0223
Compare
@imjalpreet can you add the description to the commit body as well? You need to add an empty line after the title and add the body.
|
Just these 2 lines in the commit body.
|
1cf0223
to
8def00f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @imjalpreet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imjalpreet
I see that the Json constructor for BaseHiveTableLayoutHandle is still there. Is it still needed? I don't see it's being used. Can you please add a separate commit or PR to remove it If it's no longer needed?
BaseHiveColumnHandle fields are still json annotated. Are the annotations required?
@@ -177,6 +181,7 @@ JavaClasses: | |||
- presto-hive-metastore/src/main/java/com/facebook/presto/hive/BucketFunctionType.java | |||
- presto-hive-common/src/main/java/com/facebook/presto/hive/CacheQuotaRequirement.java | |||
- presto-hive-common/src/main/java/com/facebook/presto/hive/CacheQuotaScope.java | |||
- presto-hive-common/src/main/java/com/facebook/presto/hive/BaseHiveColumnHandle.java |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is still needed. ColumnType enum is part of this class and it is required to be included in the protocol.
@@ -12,6 +12,9 @@ | |||
* limitations under the License. | |||
*/ | |||
|
|||
// HiveColumnHandle is special since we require an implementation of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
HiveColumnHandle is special since it needs an implementation of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will make this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the comment
@yingsu00 Since, this PR only includes protocol changes I haven't refactored any presto classes here. I will have a look at the Base* classes in Presto, verify the need for JSON constructors and refactor them as needed in a separate PR. |
Add Java classes from the Iceberg connector in presto_protocol.yml file. Specialize IcebergColumnHandle to support 'operator<()'.
8def00f
to
2df519d
Compare
Description
Add Java classes from the Iceberg connector in presto_protocol.yml file.
Specialize IcebergColumnHandle to support 'operator<()'.
Depends on #21764
Motivation and Context
facebookincubator/velox#5977
Impact
With these changes, we will be able to read iceberg tables from Prestissimo
Test Plan
Successfully ran all TPC-H queries with iceberg tables (both partitioned and non-partitioned)
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.