Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't read a Delta table from Azure Unity Catalog #1628

Open
MigQ2 opened this issue Sep 14, 2023 · 11 comments
Open

Can't read a Delta table from Azure Unity Catalog #1628

MigQ2 opened this issue Sep 14, 2023 · 11 comments
Labels
binding/rust Issues for the Rust crate enhancement New feature or request help wanted Extra attention is needed

Comments

@MigQ2
Copy link

MigQ2 commented Sep 14, 2023

Environment

  • Linux
  • python 3.10.10
  • deltalake==0.10.2

Environment:

  • Cloud provider: Azure Databricks

Bug

What happened:

I am trying to replicate this example from the documentation to read a Delta Table from Databricks Unity Catalog:

from deltalake import DataCatalog, DeltaTable
catalog_name = 'main'
schema_name = 'db_schema'
table_name = 'db_table'
data_catalog = DataCatalog.UNITY
dt = DeltaTable.from_data_catalog(data_catalog=data_catalog, data_catalog_id=catalog_name, database_name=schema_name, table_name=table_name)

but I get the following error:

OSError: Generic MicrosoftAzure error: Error performing token request: response error "request error", after 10 
retries: error sending request for url 
(http://<SOME-IP-ADDRESS>/metadata/identity/oauth2/token?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.
com): error trying to connect: tcp connect error: Connection refused (os error 111)

Stacktrace:

 /home/vscode/.local/lib/python3.10/site-packages/deltalake/table.py:285 in from_data_catalog     │
│                                                                                                  │
│   282 │   │   │   database_name=database_name,                                                   │
│   283 │   │   │   table_name=table_name,                                                         │
│   284 │   │   )                                                                                  │
│ ❱ 285 │   │   return cls(                                                                        │
│   286 │   │   │   table_uri=table_uri, version=version, log_buffer_size=log_buffer_size          │
│   287 │   │   )                                                                                  │
│   288                                                                                            │
│                                                                                                  │
│ /home/vscode/.local/lib/python3.10/site-packages/deltalake/table.py:246 in __init__              │
│                                                                                                  │
│   243 │   │                                                                                      │
│   244 │   │   """                                                                                │
│   245 │   │   self._storage_options = storage_options                                            │
│ ❱ 246 │   │   self._table = RawDeltaTable(                                                       │
│   247 │   │   │   str(table_uri),                                                                │
│   248 │   │   │   version=version,                                                               │
│   249 │   │   │   storage_options=storage_options, 

What you expected to happen:

I wish I could read the Delta Table

More details:

  • I can read from the storage account where the data is located using other libraries in the same python interpreter so I don't think it's a firewall problem
  • The same host and token work perfectly fine in the same interpreter to read data from the same Unity Catalog table using databricks-connect, so the URL and token are valid
@MigQ2 MigQ2 added the bug Something isn't working label Sep 14, 2023
@rtyler
Copy link
Member

rtyler commented Sep 15, 2023

I wish I could read the Delta Table

😆 me too

The Unity support in delta-rs is young I would say. I have access to a Unity environment but not an Azure specific Databricks+Unity environment. I'm not honestly sure how to start here, I assume the URL that was spit out to you is at a legitimate hostname that might otherwise respond to connections from wherever you are running this Python code?

@r3stl355
Copy link
Contributor

r3stl355 commented Oct 8, 2023

It looks like the current implementation works for storage location retrieval, but will require additional creds for data access. (in addition to Azure I also tried in AWS - similar story)

I suspect this could work if the application is running on a cloud VM with certain rights but I didn't test that (That <SOME-IP-ADDRESS> is 169.254.169.254, right? - that's a special IP usually used on cloud VMs to retrieve instance metadata information. Which gives a clue that credentials with sufficient rights are not available when code is trying to access data so it tries to obtain some via instance metadata).

In addition to being a metadata provider, Unity on Databricks also acts as an access token provider so it can enforce ACLs, etc. Using the same pattern on local/non-Databricks compute would provide a similar experience but I don't know if it's achievable at the moment(or will ever be).

A possible quick fix could be providing additional credentials that allow access to the storage managed by UC. For example, when I specify AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN in my environment vars for an AWS account which can read from that S3 location, it works on AWS. (Well, it resolves that error but I get The table's minimum reader version is 2 but deltalake only supports up to version 1 when I try to to_pyarrow_table but that's a different story).

I guess this workaround may also work in Azure with a right secret/key/token/...

@r3stl355
Copy link
Contributor

r3stl355 commented Oct 9, 2023

Actually, this looks like an expected behavior, mentioned in #1331 (comment)

@rtyler
Copy link
Member

rtyler commented Oct 9, 2023

@r3stl355 This is a topic I have recently discussed with @MrPowers and some of the Databricks team. I don't have a great solution to offer at the moment other than "we're working on figuring this out" 😄

@rtyler rtyler added binding/rust Issues for the Rust crate on-hold Issues and Pull Requests that are on hold for some reason labels Oct 9, 2023
@r3stl355
Copy link
Contributor

r3stl355 commented Oct 9, 2023

@rtyler maybe you could include me in those future conversations given I work for Databricks atm :grin

@ion-elgreco
Copy link
Collaborator

It looks like the current implementation works for storage location retrieval, but will require additional creds for data access. (in addition to Azure I also tried in AWS - similar story)

I suspect this could work if the application is running on a cloud VM with certain rights but I didn't test that (That <SOME-IP-ADDRESS> is 169.254.169.254, right? - that's a special IP usually used on cloud VMs to retrieve instance metadata information. Which gives a clue that credentials with sufficient rights are not available when code is trying to access data so it tries to obtain some via instance metadata).

In addition to being a metadata provider, Unity on Databricks also acts as an access token provider so it can enforce ACLs, etc. Using the same pattern on local/non-Databricks compute would provide a similar experience but I don't know if it's achievable at the moment(or will ever be).

A possible quick fix could be providing additional credentials that allow access to the storage managed by UC. For example, when I specify AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN in my environment vars for an AWS account which can read from that S3 location, it works on AWS. (Well, it resolves that error but I get The table's minimum reader version is 2 but deltalake only supports up to version 1 when I try to to_pyarrow_table but that's a different story).

I guess this workaround may also work in Azure with a right secret/key/token/...

The Unity Catalog in my org is becoming a huge roadblock to use Delta-RS in a broad scope outside of internal team use. No one wants to provide read credentials anymore to the storage which obliterates the use of Delta-RS within this context. Besides the possible vendor lock-in 😄, it makes interoperability with databricks not ideal, currently for any data reads we revert back to databricks-sql connector.

@davidvesp
Copy link

I have the same problem:

OSError: Generic MicrosoftAzure error: Error performing token request: response error "request error", after 10 retries: error sending request for url (http://169.254.169.254/metadata/identity/oauth2/token?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.com): error trying to connect: tcp connect error: Se ha intentado una operación de socket en una red no accesible. (os error 10051)

The 169.254.169.254 is used to retrieve the authentication token
https://learn.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-use-vm-token#get-a-token-using-http

But I don't understand why this is needed, as the Databricks documentation says we need to get a short-lived token and a signed URL:
image

@ion-elgreco
Copy link
Collaborator

I have the same problem:

OSError: Generic MicrosoftAzure error: Error performing token request: response error "request error", after 10 retries: error sending request for url (http://169.254.169.254/metadata/identity/oauth2/token?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.com): error trying to connect: tcp connect error: Se ha intentado una operación de socket en una red no accesible. (os error 10051)

The 169.254.169.254 is used to retrieve the authentication token
https://learn.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-use-vm-token#get-a-token-using-http

But I don't understand why this is needed, as the Databricks documentation says we need to get a short-lived token and a signed URL:
image

Interesting, so UC by design gives a token to read the data from storage. Then this token should just be returned when you query databricks REST APIs get table

@ion-elgreco ion-elgreco self-assigned this Jun 7, 2024
@tunayokumus
Copy link

Hi @ion-elgreco, is this a good time to address this again, now that Unity Catalog OSS version 0.2.0 is released with credential vending support? Does this make it easier/clearer to implement?

https://github.com/unitycatalog/unitycatalog/releases/tag/v0.2.0

@ion-elgreco
Copy link
Collaborator

Hi @ion-elgreco, is this a good time to address this again, now that Unity Catalog OSS version 0.2.0 is released with credential vending support? Does this make it easier/clearer to implement?

https://github.com/unitycatalog/unitycatalog/releases/tag/v0.2.0

Sure, feel free to take a jab at it

@ion-elgreco ion-elgreco removed their assignment Nov 12, 2024
@ion-elgreco ion-elgreco added help wanted Extra attention is needed enhancement New feature or request and removed on-hold Issues and Pull Requests that are on hold for some reason bug Something isn't working labels Dec 7, 2024
@tunayokumus
Copy link

Hi @ion-elgreco, is this a good time to address this again, now that Unity Catalog OSS version 0.2.0 is released with credential vending support? Does this make it easier/clearer to implement?
https://github.com/unitycatalog/unitycatalog/releases/tag/v0.2.0

Sure, feel free to take a jab at it

Well this could be good excuse for me to learn Rust indeed, I might do that.
But I'm not yet sure if the credential vending is the key enabler here. I recently saw the issue you created on the unity-catalog-python repo unitycatalog/unitycatalog-python#4
Is this related to this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants