Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add auth type for Azure storage #77

Closed
wants to merge 10 commits into from
Closed

Conversation

cgpoh
Copy link

@cgpoh cgpoh commented Aug 2, 2024

Description

This PR is to address 69

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

This had being tested locally. With the changes in the PR, ADLS user will be able to choose using either SAS token or DefaultAzureCredentialBuilder for authentication. To verified that it is working, I verified by sending a curl request:

curl -i -X POST -H "Authorization: Bearer $PRINCIPAL_TOKEN" -H 'Accept: application/json' -H 'Content-Type: application/json' \
  http://${POLARIS_HOST:-localhost}:8181/api/management/v1/catalogs \
  -d '{"name": "polaris", "type": "INTERNAL", "properties": {
        "default-base-location": "abfss://[email protected]/test/"
    },"storageConfigInfo": {
        "tenantId": "long-tenant-id",
        "storageType": "AZURE",
        "allowedLocations": [
            "abfss://[email protected]/test/"
        ],
        "authType": "APPLICATION_DEFAULT"
    } }'

and after following the readme on create service principal, granting roles. I'm able to ran my spark job to write data to ADLS successfully.

Test Configuration:

  • Firmware version:
  • Hardware:
  • Toolchain:
  • SDK:

Checklist:

Please delete options that are not relevant.

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • If adding new functionality, I have discussed my implementation with the community using the linked GitHub issue
  • I have signed and submitted the ICLA and if needed, the CCLA. See Contributing for details.

@cgpoh cgpoh requested a review from a team as a code owner August 2, 2024 18:16
Copy link
Contributor

@dennishuo dennishuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a stab at this! It seems the general problem of needing to configure a Polaris deployment to possibly use "application defaults" is potentially common to all cloud providers, even if the mechanics of what "application defaults" entail will differ.

This could be worth some more discussion on some subtle points in your linked issue #69 -- I'll post some additional thoughts there.

required:
- tenantId
- authType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll probably want to be conservative about adding required fields to the API objects, especially if they have impact on persisted entities. In this case, it could probably at least be made optional to be minimally invasive if the default preserves the existing behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I will make this optional and have another enum: NONE to fallback to SAS_TOKEN if no authType is specified.

case APPLICATION_DEFAULT:
break;
}
credentialMap.put(PolarisCredentialProperty.AZURE_SAS_TOKEN, sasToken);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of overwriting this config key with "" when not using SAS_TOKEN auth type, if we pull this under the SAS_TOKEN case then in theory the server could be configured to simply either allow total fallthrough to "application defaults" that may look through environment variables, standard credential-config files, VM "metadata server", etc., or inheriting statically-configured credential settings in a Catalog's properties.

Such an option would need to be configurable at the top-level server config though, to specify whether individual catalogs should really be allowed to force using such defaults.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in another RBAC rule to limit the authType?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm thinking one level higher, where the server-level global config can dictate whether or not credential-vending and subscoping is used at all. Some details in this comment: #69 (comment)

In particular,

At a high level we at least need to have a strict separation of effective privileges between the personas who can configure and run the Polaris server itself and those who can call createCatalog. In a mutual-trust setting, it makes sense to have relaxed constraints on the server-level configuration, but it needs to be possible to run the server in a secure mode as well where catalog creators are in a different realm of trust than the admins of the server.

Basically, instead of complicating the API model or RBAC model, maybe it'll be easier to do all this short-circuiting in BasePolarisCatalog.java instead. In particular, this line is an example of how to define a server-level configuration setting:

And maybe you can put the short-circuit here:

tableLocations.forEach(tl -> validateLocationForTableLike(tableIdentifier, tl));

after the "validateLocationForTableLike" call and before any attempt to get a subscoped credential is made. Basically just LOGGER.atInfo and then return early.

# Conflicts:
#	.gitignore
@cgpoh
Copy link
Author

cgpoh commented Aug 15, 2024

Thanks for taking a stab at this! It seems the general problem of needing to configure a Polaris deployment to possibly use "application defaults" is potentially common to all cloud providers, even if the mechanics of what "application defaults" entail will differ.

This could be worth some more discussion on some subtle points in your linked issue #69 -- I'll post some additional thoughts there.

Thanks @dennishuo, agree that the “application defaults” is potentially common to all cloud providers. In fact, I’m borrowing the “application defaults” concept from Nessie.

# Conflicts:
#	docs/index.html
#	polaris-core/src/main/java/org/apache/polaris/core/storage/azure/AzureCredentialsStorageIntegration.java
#	polaris-core/src/main/java/org/apache/polaris/core/storage/azure/AzureStorageConfigurationInfo.java
@RussellSpitzer
Copy link
Member

@dennishuo can you take another look at this? I notice you were reviewing most recently.

@dennishuo
Copy link
Contributor

Continuing discussion from #208 (comment)

There are two use cases to consider:

  1. How to make the Polaris server itself use APPLICATION_DEFAULT credentials when reading/writing metadata files itself
  2. How to vend out credentials to external engines that don't go through the currently-supported subscoping flows

It seems the current state of this PR would only provide a way to do (1), by allowing catalog-creators to set per-catalog config values dictating for Polaris to use APPLICATION_DEFAULT behavior when reading/writing files itself. However, this ability poses a problem for situations where the set of admins who run the Polaris server are different from the set of admins who interact with the Polaris server to create catalogs. For this scenario, it's preferable to set SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION=true at the server-config level to make Polaris use APPLICATION_DEFAULT semantics in its local server environment.

For case (2), I don't think there's yet a proposed solution. The APPLICATION_DEFAULT concept itself is probably not sufficiently expressive for this, because by nature APPLICATION_DEFAULT hides a bunch of "convenience" fallthroughs for trying to look for credentials in the local environment, which might include standard credential files (e.g. ~/.awscredentials), environment variables, or local cloud VM "metadata servers" (e.g. http://169.254.169.254).

Not all of these are created equal for suitability for credential-vending, if at all.

The most plausible use case would be to have a flow that allows simply handing out VM instance metadata-based tokens for credential-vending:

I believe these are all designed to be "short-lived" credentials where security isn't compromised by handing them out, but they may lack the kinds of "downscoping" semantics normally needed in more advanced Polaris deployments.

We could explore an option where these metadata-server-based tokens are returned for credential-vending purposes.

@cgpoh
Copy link
Author

cgpoh commented Aug 30, 2024

For this scenario, it's preferable to set SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION=true at the server-config level to make Polaris use APPLICATION_DEFAULT semantics in its local server environment.

@dennishuo , I'm not really understanding this scenario, meaning declaring SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION=true at the server-config level and have the compute jobs that are interfacing with Polaris setting the credentials in their respective env vars?

I'm looking at how to use managed identities in Azure and hopefully can change the APPLICATION_DEFAULT option to METADATA_SERVER option

@cgpoh
Copy link
Author

cgpoh commented Sep 4, 2024

@dennishuo , unfortunately my company policy doesn’t allow me to create managed identity too and I’m not able to test the behaviour. I will test the skip credential subscoping again with SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION=true as I realised that the last test I conducted, I have my azure credential set in adls cli.

@cgpoh
Copy link
Author

cgpoh commented Sep 5, 2024

@dennishuo , after more testing, SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION=true is sufficient for my current use case. We close this PR for now?

@flyrain
Copy link
Contributor

flyrain commented Sep 12, 2024

Close it now. Feel free to reopen if needed.

@flyrain flyrain closed this Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants