Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] Add option to use environment variables for ADLS Subscoped credentials #69

Open
cgpoh opened this issue Aug 2, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@cgpoh
Copy link

cgpoh commented Aug 2, 2024

Is your feature request related to a problem? Please describe.
My organization does not allow getting user delegation key in Azure and the only option for us to authenticate with Azure is to use service principal. When my spark job tries to write to Azure, I will get the following exception at Polaris server:

c.a.s.f.d.DataLakeServiceClient: If you are using a StorageSharedKeyCredential, and the server returned an error message that says 'Signature did not match', you can compare the string to sign with the one generated by the SDK. To log the string to sign, pass in the context key value pair 'Azure-Storage-Log-String-To-Sign': true to the appropriate method call. If you are using a SAS token, and the server returned an error message that says 'Signature did not match', you can compare the string to sign with the one generated by the SDK. To log the string to sign, pass in the context key value pair 'Azure-Storage-Log-String-To-Sign': true to the appropriate generateSas method call. Please remember to disable 'Azure-Storage-Log-String-To-Sign' before going to production as this string can potentially contain PII."
<?xml version="1.0" encoding="utf-8"?><Error><Code>AuthorizationPermissionMismatch</Code><Message>This request is not authorized to perform this operation using this permission.</Message></Error>"

Describe the solution you'd like
Since ADLSFileIO fall back to use DefaultAzureCredentialBuilder when there are no SAS token or Storage Shared Key credential, we can have a default catalog option to use environment as authentication type e.g.:

{
    "name": "test",
    "type": "INTERNAL",
    "properties": {
        "default-base-location": "abfss://[email protected]/test/"
    },
    "storageConfigInfo": {
        "tenantId": "tenant-id",
        "storageType": "AZURE",
        "allowedLocations": [
            "abfss://[email protected]/test/"
        ],
        "authType": "APPLICATION_DEFAULT"
    }
}

so that any query engine can abstract away the credential and the main credential still being govern by Polaris Catalog for Azure.

@cgpoh cgpoh added the enhancement New feature or request label Aug 2, 2024
@annafil annafil moved this to Triage in Basic Kanban Board Aug 2, 2024
@cgpoh cgpoh mentioned this issue Aug 2, 2024
13 tasks
@dennishuo
Copy link
Contributor

Thanks for bringing this up @cgpoh ! There are a few considerations worth discussing and soliciting some input from additional folks to identify the best way forward:

  1. Would there ever be scenarios where we want a single deployment to use the credential-vending SAS_TOKEN pattern for some catalogs/storageConfigs, while short-circuiting it to use "application defaults" for other catalogs, or would the choice of whether to use credential-vending semantics typically be a server-wide setting?
  2. Is it better to convey the concept of "fallthrough to application defaults for credentials" with a single common syntax across all cloud providers or explicitly have such an APPLICATION_DEFAULT type separately defined for each cloud provider's storage configuration?
  3. How should the server-level configuration expose the controls for whether to allow setting certain AuthTypes at a per-catalog level at all? For example, if the Polaris server is running in a more sensitive/privileged context, it may need to be configured to block the ability for individual catalogs to choose APPLICATION_DEFAULT credentials
  4. Should the "fallthrough" behavior go further to allow propagating static credential config settings in the catalog properties map directly into the initialization of a FileIO? Should the behavior just skip going through the whole StorageConfigurationInfo/StorageIntegration stack entirely, including skipping allowed-location validations?
  5. What are the semantics for returned vended credentials to the processing engine making the call? Should APPLICATION_DEFAULT only mean the Polaris itself uses local application defaults for interacting with files while ignoring any X-Iceberg-Access-Delegation settings and letting the caller engine fend for itself, or should the server be able to translate something that comes from APPLICATION_DEFAULT credentials into returned config settings that the remote engine can then use to access files?

At a high level we at least need to have a strict separation of effective privileges between the personas who can configure and run the Polaris server itself and those who can call createCatalog. In a mutual-trust setting, it makes sense to have relaxed constraints on the server-level configuration, but it needs to be possible to run the server in a secure mode as well where catalog creators are in a different realm of trust than the admins of the server.

One possibility that may require fewer changes to the management API and persistence model itself would be to have some server-level configuration settings that basically just short-circuit the storage validation/subscoping logic in BasePolarisCatalog::refreshCredentials and allow the FileIO initialization to fallback to default behaviors in how it looks for credentials from the environment.

@jbonofre @snazy @flyrain @RussellSpitzer @collado-mike

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants