Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] Add basic HDFS storage option for catalogs #85

Open
rdsarvar opened this issue Aug 3, 2024 · 6 comments
Open

[FEATURE REQUEST] Add basic HDFS storage option for catalogs #85

rdsarvar opened this issue Aug 3, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@rdsarvar
Copy link

rdsarvar commented Aug 3, 2024

Is your feature request related to a problem? Please describe.
Currently it appears that the storage options are geared towards cloud providers. To support companies running on premise I would like to request HDFS support.

Describe the solution you'd like
Catalog R/W support HDFS as a storage option.

For the first implementation we would like something basic:

  • The service is able to R/W against a HDFS cluster without authentication/authorization required.
@rdsarvar rdsarvar added the enhancement New feature or request label Aug 3, 2024
@flyrain
Copy link
Contributor

flyrain commented Aug 6, 2024

We can add HDFS support, but handling credentials might be difficult. @sfc-gh-schen, can you provide more insight? I think it may not be possible for credential vending.

@rdsarvar
Copy link
Author

Would permissive HDFS be a simpler first ask with potential evolutions in the future to support credentials? Aka start off assuming that the HDFS cluster is open (no Kerberos, etc), running on-premise protected by networking, and then consider a proper strategy in the future?

Sorry if I'm off the mark on what you meant by your last comment but I assumed it's relating to authn/authz against HDFS (and not mapping to an internal strategy in Polaris).

@flyrain
Copy link
Contributor

flyrain commented Aug 13, 2024

Yup, it's reasonable to start with non-authentication and non-authorization for HDFS.

@rdsarvar
Copy link
Author

I don't mind trying my hand at this for the simple case and provide a baseline for people to extend in the future.

I think in order for this to work, due to the DFS is created inside of Iceberg core with the Hadoop configuration object we initialize, we'll need to rely on the HADOOP_USER_NAME environment variable (which will mean all DFS interaction would be under the same username). Otherwise it'll default access to be the user running the service.

I can see if I find time one of these days to throw something together in a PR (as an non-tested PoC) just to get feedback on if it's something we want to move forward

@rdsarvar rdsarvar changed the title [FEATURE REQUEST] Add HDFS storage option for catalogs [FEATURE REQUEST] Add basic HDFS storage option for catalogs Aug 18, 2024
@rdsarvar
Copy link
Author

@flyrain QQ about the repo which I noticed starting to write the changes required by this MR:

  1. Do the regtests have Open API Python templates that aren't commit into the repo? When I generate the files with the provided commands in the README it results in generated Python files without the license header and in some cases it breaks the functionality. An example diff of regtests/client/python/polaris/management/models/aws_storage_config_info.py:
         _obj = cls.model_validate({
             "storageType": obj.get("storageType"),
-            "allowedLocations": obj.get("allowedLocations"),
-            "roleArn": obj.get("roleArn")
+            "allowedLocations": obj.get("allowedLocations")
         })
  1. I was thinking about how to add integration / e2e tests and noticed that there isn't really any integration tests (outside of polaris-service/src/test/java/io/polaris/service/catalog/PolarisSparkIntegrationTest.java), do we rely on the e2e tests for testing against the storage providers? -- Curious to what your preference would be for this repo

@flyrain
Copy link
Contributor

flyrain commented Sep 3, 2024

cc @dennishuo @collado-mike @eric-maynard for the first question.

for 2, we cannot really do that without a sponsor of cloud environments. We have discussed using minIO to simulate it. But for HDFS, it should be OK to add integration tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants