Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyiceberg always return false for catalog.table_exists when using Snowflake managed Polaris service #96

Open
djouallah opened this issue Aug 6, 2024 · 23 comments
Labels
bug Something isn't working

Comments

@djouallah
Copy link

using polaris managed offering and pyiceberg 0.7, catalog.table_exists will return false even if the table exist

image
@eric-maynard
Copy link
Contributor

eric-maynard commented Aug 6, 2024

Hi @djouallah, this project is for OSS Polaris and not any vendor-managed deployment of Polaris. Having said that, I'm interested in taking a look -- can you share your code? I was not able to reproduce this.

I created a config file pyiceberg.json:

{
  "type": "rest",
  "base_uri": "http://localhost:8181/api/catalog",
  "warehouse": "issue_96_catalog",
  "credential": "XXXX:YYYY",
  "scope": "PRINCIPAL_ROLE:ALL"
}

I used the CLI to create the catalog, and used another engine to create a namespace and table in the catalog.

I then ran this Python script using PyIceberg:

from pyiceberg.catalog.rest import RestCatalog
import json

with open('pyiceberg.json', 'r') as f:
    config = json.load(f)

catalog = RestCatalog(
    name='issue_96_catalog',
    uri=config['base_uri'],
    warehouse=config['warehouse'],
    credential=config['credential'],
    scope=config['scope']
)

print(catalog.list_namespaces())
print(catalog.list_tables('ns'))
print(catalog.table_exists('ns.t1'))

I saw that the table_exists call behaved as expected:

$ python -m test
[('ns',)]
[('ns', 't1')]
True

@djouallah
Copy link
Author

all good, I will open a ticket i guess.

@kevinjqliu
Copy link
Contributor

Chiming into this, PyIceberg currently has an issue when parsing the table identifier with the "catalog name".

How did you initialize the catalog object in PyIceberg? The catalog name must match the one in the config.
For example, the config value ("warehouse": "issue_96_catalog") in the config file must matches the RestCatalog's parameter (name='issue_96_catalog').

@djouallah djouallah reopened this Aug 6, 2024
@djouallah
Copy link
Author

djouallah commented Aug 6, 2024

polaris
case 'polaris':
     catalog = load_catalog(
          'default',
           uri='https://xxxxxxxxxxxxxxxxxxxxxxxxxxxxx.snowflakecomputing.com/polaris/api/catalog',
           warehouse='s3',
           scope = 'PRINCIPAL_ROLE:no' ,
           credential= userdata.get("polaris_key")
        )

@kevinjqliu
Copy link
Contributor

In PyIceberg, the catalog is named default. I'm not familiar with the Polaris UI, but I assume it's named s3.

If you rename the PyIceberg catalog to s3 and retry the code above, it might work.

@djouallah
Copy link
Author

no luck :(

@kevinjqliu
Copy link
Contributor

Weird, I'll try to repro

@TomerHeber
Copy link

To check table existence pyiceberg calls "HEAD" via the API.

In OSS Polaris it currently works as expected.

If I had to guess, the managed Polaris may not be listening to HEAD requests and therefore might be returning 404 which translates to False.

@djouallah djouallah changed the title pyiceberg always return false for catalog.table_exists pyiceberg always return false for catalog.table_exists when using Snowflake managed Polaris service Aug 7, 2024
@kevinjqliu
Copy link
Contributor

@kevinjqliu
Copy link
Contributor

@eric-maynard which credentials are you using?

I tried to reproduce with a local docker deployment of Polaris

catalog = RestCatalog(
    name='polaris',
    uri='http://localhost:8181/api/catalog',
    warehouse='polaris',
    credential='e9b1a9c3ceeafcfc:885e8a756af4ebe383774877ea774b34',
    scope='PRINCIPAL_ROLE:ALL',
)

the credentials are from the logs

2024-08-05 11:39:38 realm: default-realm root principal credentials: e9b1a9c3ceeafcfc:885e8a756af4ebe383774877ea774b34

I cannot create a table due to permission issues

ForbiddenError: ForbiddenException: Principal 'root' with activated PrincipalRoles '[]' and activated ids '[2, 4]' is not authorized for op CREATE_TABLE_DIRECT_WITH_WRITE_DELEGATION

@eric-maynard
Copy link
Contributor

@kevinjqliu it shouldn't matter so long as the credentials you're using have CREATE_TABLE_DIRECT_WITH_WRITE_DELEGATION for the table you're trying to create. You should be able to use the root credentials or those of any principal with the appropriate grants.

Using those same credentials, can you create a table with Spark?

@kevinjqliu
Copy link
Contributor

With the root credentials, I was able to create tables using Pyspark, but not with PyIceberg

Repro

Run a new Polaris instance in docker

docker-compose -f docker-compose.yml up

Create a new catalog in Polaris

PRINCIPAL_TOKEN="principal:root;realm:default-realm"

# Use local filesystem by default
curl -i -X POST -H "Authorization: Bearer $PRINCIPAL_TOKEN" -H 'Accept: application/json' -H 'Content-Type: application/json' \
  http://localhost:8181/api/management/v1/catalogs \
  -d '{
        "catalog": {
          "name": "polaris",
          "type": "INTERNAL",
          "readOnly": false,
          "properties": {
            "default-base-location": "file:///tmp/polaris/"
          },
          "storageConfigInfo": {
            "storageType": "FILE",
            "allowedLocations": [
              "file:///tmp"
            ]
          }
        }
      }'

Get the root credentials in Polaris log

Edit the ROOT_CREDENTIAL variable inside jupyter notebook

Run jupyter notebook

Polaris <> PyIceberg.ipynb

@eric-maynard
Copy link
Contributor

Hi @kevinjqliu, I was able to reproduce this -- I can't create tables with PyIceberg using root credentials. The error I see is:

pyiceberg.exceptions.ForbiddenError: ForbiddenException: Principal 'root' with activated PrincipalRoles '[]' and activated ids '[63449, 63549]' is not authorized for op CREATE_TABLE_DIRECT_WITH_WRITE_DELEGATION

@sfc-gh-dhuo
Copy link

It's a bit unintuitive, but due to the intentional segmentation of "metadata management" and "content management", the default ROOT actually does not automatically have TABLE_WRITE_DATA or TABLE_READ_DATA privileges, which are required for xIcebergAccessDelegation. My guess here when Spark is able to createTable but not PyIceberg is that PyIceberg got configured to send the X-Iceberg-Access-Delegation header which is interpreted as requesting credential-vending, which requires TABLE_WRITE_DATA, while perhaps PySpark wasn't configured with spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation=true.

Some of the distinction between CONTENT and METADATA is called out here: https://polaris.io/#tag/Access-Control/Access-control-privileges

The initialization where the catalog_admin default role per-catalog is assigned privileges is here: https://github.com/polaris-catalog/polaris/blob/32eebae59fe953057a6d60d4eeaf4c1bb7e9c0b9/polaris-core/src/main/java/io/polaris/core/persistence/PolarisMetaStoreManagerImpl.java#L588 -- it only gets assigned CATALOG_MANAGE_ACCESS and CATALOG_MANAGE_METADATA without CATALOG_MANAGE_CONTENT.

This can be solved by either omitting the X-Iceberg-Access-Delegation header when creating the table or granting the catalog_admin (which is granted to the default root via the service_admin PrincipalRole by default) with the privilege CATALOG_MANAGE_CONTENT.

@kevinjqliu
Copy link
Contributor

Thanks for taking a look at this.

@sfc-gh-dhuo I think your theory is correct. PyIceberg's REST client sets the X-Iceberg-Access-Delegation header by default
https://github.com/apache/iceberg-python/blob/eca98707f30bdaf1ffd19f039058cc7e18c5f9cb/pyiceberg/catalog/rest.py#L535

Everything works after I comment that out

@kevinjqliu
Copy link
Contributor

Seems to me that PyIceberg should set the X-Iceberg-Access-Delegation header only when accessing the underlying table's data.

@kevinjqliu
Copy link
Contributor

Reading the spec for X-Iceberg-Access-Delegation
https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1488-L1512

It sounds like clients are free to send that header on requests, as a way to signal the server of its capabilities.

PyIceberg got configured to send the X-Iceberg-Access-Delegation header which is interpreted as requesting credential-vending, which requires TABLE_WRITE_DATA

I'm interpreting the header as a signal for credential vending or remote signing capabilities. Perhaps the Polaris permission model is making the assumption that setting the header means requesting for table access.

@kevinjqliu
Copy link
Contributor

I guess the ultimate question is, what permission is required to run a create_table statement?
create_table operation does two things; it writes an entry to the metadata store and also writes a metadata JSON file to the metadata location

@dennishuo
Copy link
Contributor

dennishuo commented Aug 9, 2024

@kevinjqliu Good question! This is an area that's probably worth adding more explicit documentation for. The ground truth for privilege requirements can be found in PolarisAuthorizableOperation.java:

  CREATE_TABLE_DIRECT(TABLE_CREATE),
  CREATE_TABLE_DIRECT_WITH_WRITE_DELEGATION(EnumSet.of(TABLE_CREATE, TABLE_WRITE_DATA)),

And the relationship between direct privileges vs inheritance through "super-privileges" can be found in PolarisAuthorizer.java:

SUPER_PRIVILEGES.putAll(
    TABLE_CREATE,
    List.of(
        CATALOG_MANAGE_CONTENT, CATALOG_MANAGE_METADATA, TABLE_CREATE, TABLE_FULL_METADATA));
....
SUPER_PRIVILEGES.putAll(TABLE_WRITE_DATA, List.of(CATALOG_MANAGE_CONTENT, TABLE_WRITE_DATA));

So you need only TABLE_CREATE "or better" to createTable without setting X-Iceberg-Access-Delegation.

And you need both TABLE_CREATE and TABLE_WRITE ("or better") to createTable if you do set X-Iceberg-Access-Delegation.

In both cases Polaris will be able to write to the metadata store and write the metadata JSON file itself, but the difference is that if you specify X-Iceberg-Access-Delegation, Polaris will also return a storage credential directly in the createTable response, the same way it would for loadTable. This is because in some scenarios when Spark creates the Iceberg table, it will rely on the credential in the response to then subsequently perform further actions on the table, instead of doing a separate loadTable request after creation to obtain storage credentials.

@kevinjqliu
Copy link
Contributor

@dennishuo thanks for the explanation.

This is because in some scenarios when Spark creates the Iceberg table, it will rely on the credential in the response to then subsequently perform further actions on the table, instead of doing a separate loadTable request after creation to obtain storage credentials.

This is an interesting behavior. In this case, Polaris must vend credentials with the createTable response for the metadata JSON file, which requires additional permissions.
I found the branching logic for with/without the X-Iceberg-Access-Delegation header in the createTable function

...

Regarding the original error message. I confirmed that setting X-Iceberg-Access-Delegation in Spark also causes the same error, so the behavior is consistent across Spark and PyIceberg.

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Aug 9, 2024

(Looks like CREATE_TABLE_DIRECT_WITH_WRITE_DELEGATION is a new change from #27)

To summarize, this boils down to the set of permissions the ROOT principal has by default.
The ROOT principal has TABLE_CREATE permission but not TABLE_WRITE_DATA permission, which means only createTable without X-Iceberg-Access-Delegation is allowed.

ROOT permissions

ROOT is granted the following permissions

But not CATALOG_MANAGE_CONTENT permission, as you mentioned above.

TABLE_CREATE is allowed since ROOT has CATALOG_MANAGE_METADATA permission.
https://github.com/polaris-catalog/polaris/blob/be5c29af3332e2a77a3432661626cd3f443d5f4a/polaris-core/src/main/java/io/polaris/core/auth/PolarisAuthorizer.java#L136-L139

TABLE_WRITE_DATA is not allowed since ROOT has neither CATALOG_MANAGE_CONTENT or TABLE_WRITE_DATA permissions.
https://github.com/polaris-catalog/polaris/blob/be5c29af3332e2a77a3432661626cd3f443d5f4a/polaris-core/src/main/java/io/polaris/core/auth/PolarisAuthorizer.java#L238

(I'm probably butchering the semantics of privileges/permissions, but the gist is there :) )

@kevinjqliu
Copy link
Contributor

From an access control perspective, we probably don't want to grant ROOT with CATALOG_MANAGE_CONTENT permission by default, right?

@djouallah
Copy link
Author

nice it seems pyiceberg deployed a fix, who thoughts having multiple catalog implementation is a good thing !!!
apache/iceberg-python#1096

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants