Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for generating Azure Blob Storage signed URLs #22

Merged
merged 6 commits into from
Dec 29, 2023

Conversation

tdikland
Copy link
Contributor

This PR introduces support for sharing Delta Lake tables stored on Azure Blob Storage.

I've marked the PR as WIP, since I feel there are still a couple of open questions that we may want to address in this or subsequent issues.

  1. Is it okay to change the signature of the function signing the urls to be async?
    The crate that I used to call the Azure SDK for building a pre-signed URL has an async method to do so. I figured this was okay, since all route handlers are async anyway.
  2. What should we do if there are no credentials available to sign the urls?
    Right now the urls are passed through to the client unsigned. We may want to present an error instead.
  3. In Azure there are multiple possibilities for authentication to Azure Blob Storage. I've chosen to implement only the Shared Key method, since that brings this crate at parity with the scala implementation

On a higher level:

  1. I think it is a good idea to bring in a crate that can handle signing URLs for the different cloud object stores where the Delta Table lives. To this end I've published a crate that I used in some other personal projects to do exactly this https://crates.io/crates/cloud-file-signer. This crate defines a CloudFileSigner trait and implements it for a couple of popular object stores (S3, Azure Blob Storage, GCS) and could be extended to other stores (R2, MinIO, etc.)

Looking forward to some feedback!

Closes #18

@tdikland tdikland marked this pull request as draft December 14, 2023 15:36
@ognis1205
Copy link
Contributor

Hi Tim, thank you for submitting this PR!

I haven't delved into the details of the PR yet, but it looks great. That aside, I like your idea of the Delta Sharing Server SDK, the delta-sharing-server-rs project. Since the Delta Sharing protocol is defined through REST APIs, there may be some considerations to keep in mind; nevertheless, the idea is interesting.

As for the cloud-file-signer crate, I believe it would be more beneficial to expand it into a larger project, similar to the delta-sharing-server-rs. If your contribution here serves any purpose, please feel free to incorporate it.

@ognis1205
Copy link
Contributor

ognis1205 commented Dec 14, 2023

@tdikland

Regarding the open questions, here are my answers:

  1. Yes, it is totally fine, and it would be better to do so.
  2. According to the official protocol specification, there is no explicit description. However, I think we should return a 500 error and trace the error.
  3. For now, I would not like to introduce complexity, so I really appreciate your implementation that aligns with the reference server implementation. When we want to add further options, we should create another issue.

@tdikland
Copy link
Contributor Author

Thanks for your quick reaction and enthusiasm @ognis1205.

Regarding the delta-sharing-server-rs project, I think it would be great for users to have both the lib that implements the delta sharing protocol and the batteries included version (which this crate is currently focusing on). The user journey that I'm thinking about is a user starts by deploying the reference implementation of the delta sharing server and when customisation is needed they can take the lib and plugin custom auth, retry, other middleware, different database etc. It may be nice to support this user journey with this project. I will open a new issue to discuss this in more depth.

Regarding the support for the underlying object stores (i.e. AWS, Azure, GCP). Thanks for your insights on my questions. I will be iterating on the implementation and let you know when it's ready for review.

@ognis1205
Copy link
Contributor

ognis1205 commented Dec 15, 2023

I am very grateful for your contribution @tdikland !

This is a quick review of the current commit. Regarding the 'cloud-file-signer' crate, I believe there is operational experience in your project, but I think it still needs a bit more incubation. Therefore, I believe it's still early to use the crate. Since this project itself is also in incubation, considering the future development speed, I would prefer to have the implementation of signed URLs for each cloud provider done here as much as possible. The experience gained here can then be reflected in 'cloud-file-signer,' and eventually, I would like to use the crate in 'delta-sharing-rs' as well.

Please do not remove utilities::signed_url (or rename it utilities::cloud_signer). Instead, can we consider implementing factory methods to instantiate the cloud signer instance from the credentials of each provider (similar to what is done in the bootstrap module)? I would like a factory method to be added to the struct Utility in utilities::signed_url (or utilities::cloud_signer) module. The reason is that having information in the form of logging, indicating which provider's credentials were loaded during server bootstrap and the design and the implementation would be more consistent with other modules. What do you think?

@ognis1205
Copy link
Contributor

ognis1205 commented Dec 15, 2023

@tdikland

Regarding the 'cloud-file-signer' crate, even if, for instance, we are not currently using the crate, I am considering stating in the README that we anticipate potential future use of it as a related project.

@tdikland
Copy link
Contributor Author

Thanks for the feedback. I agree it is not yet desirable to depend on the cloud-file-signer crate. I was experimenting with it and accidentally pushed it to this PR (sorry!). I'll give a heads-up once it is ready for review!

This reverts commit 16ac210.
@ognis1205
Copy link
Contributor

You don't have to apologize for it at all! I apologize if I have caused any disruption to the development. We should have prepared a more accessible system, such as a Code of Conduct (COC), to make participation easier. It was very helpful that you made various suggestions and asked questions.

@tdikland
Copy link
Contributor Author

Allright, I think I got it down now.

@tdikland tdikland marked this pull request as ready for review December 15, 2023 16:04
@ognis1205
Copy link
Contributor

ognis1205 commented Dec 16, 2023

Thank you for your cooperation @tdikland :)

I believe we have reached a consensus on the general direction of the implementation. It seems there are still some parts that I would like to discuss though, before proceeding to the review, let's go ahead and run some tests for now. @roeap could you please approve the workflow? (I deliberately commented out the #[tokio::test] attribute in some unit tests, but I might have added comments to explain. It's my mistake. Let's discuss it later as well)

When you are ready to review, please remove [WIP] from the title. Or do you prefer to conduct detailed reviews gradually?

@tdikland
Copy link
Contributor Author

Hi @ognis1205, @roeap, I'm ready for reviewing.

@tdikland tdikland changed the title [WIP] Add support for generating Azure Blob Storage signed URLs Add support for generating Azure Blob Storage signed URLs Dec 21, 2023
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good idea to bring in a crate that can handle signing URLs for the different cloud object stores where the Delta Table lives. To this end I've published a crate that I used in some other personal projects to do exactly this https://crates.io/crates/cloud-file-signer. This crate defines a CloudFileSigner trait and implements it for a couple of popular object stores (S3, Azure Blob Storage, GCS) and could be extended to other stores (R2, MinIO, etc.)

In the larger rust-arrow ecosystem the community has been developing the object_store crate which abstracts access to a large number of object stores, without depending on all the individual SDKs to avoid bloat.

@tustvold - would adding methods to acquire pre-signed URls something you see in-scope for object store, or would that dilute the focussed API too much?

I did not see the discussion around delta-sharing-server-rs, but in the past @ognis1205 and I had discussed that we want to break this up into smaller crates. and split the different "sub-apis" such that one can easily mix-and-match a server that supports the sharing protocol with just the desired features.

Overall looking great - If @ognis1205 gives his OK, I am happy to merge - knowing that we have some way to go before we can release this project more "officially".

url: String::from(url.as_str()),
}),
"s3" | "s3a" => Ok(Self::Aws),
"abfss" | "abfs" => Ok(Self::Azure),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had interesting discussions around this in the past :D - essentially in azure there are many ways a scheme could look like, some "official" some de-facto standards (e.g. what fsspec adopted).. Also onelake did not make things less confusing.

Technically abfs[s] refers to a specific implementation of an hdfs compliant driver...

In the end its fine to do this for now, I'll eave some comments later around the general signing question.

@tustvold
Copy link

tustvold commented Dec 21, 2023

acquire pre-signed URls

We already support this for S3. Azure and GCP are more complex but would be happy for contributions in this space - https://docs.rs/object_store/latest/object_store/signer/trait.Signer.html

@ognis1205
Copy link
Contributor

ognis1205 commented Dec 21, 2023

@tdikland @roeap
I will review this PR in the next 12 hours, here in Tokyo, it's around 24:00 midnight.

I did not see the discussion around delta-sharing-server-rs, but in the past @ognis1205 and I had discussed that
we want to break this up into smaller crates. and split the different "sub-apis" such that one can easily mix-and-match
a server that supports the sharing protocol with just the desired features.

Is there any place to discuss this issue? I have some opinions on this since the protocol is defined over REST APIs. I think this project might be helpful:

https://github.com/HeroicKatora/oxide-auth

Depending on the conclusion of our discussion, in some cases, I think it would be acceptable to change the name of this project to something like "delta-sharing-sdk" and have @tdikland and @roeap take the lead in managing the project. This would allow for more efficient development, I believe.

@tustvold
Copy link

FWIW I've created two upstream tickets on object_store for adding support for Azure and GCP, along with links to the relevant docs. I think they should be relatively straightforward if anyone wanted to pick them up:

@roeap
Copy link
Collaborator

roeap commented Dec 22, 2023

@tustvold - i know the azure apis quite well, and signing using keys should (hopefully) be straignforward to leverage the existing signing using the master key ... user delegated are just API cal li think :)

had a quick glance at GCP - might do that second - seems similar, cannonicalize request and sign ...

@roeap
Copy link
Collaborator

roeap commented Dec 29, 2023

@tdikland - just FYI, started to add url signing for azure to object store apache/arrow-rs#5259.

@roeap roeap merged commit ee20de9 into delta-incubator:main Dec 29, 2023
8 checks passed
@tdikland tdikland deleted the feat/azure branch January 2, 2024 10:25
@tdikland
Copy link
Contributor Author

tdikland commented Jan 2, 2024

@roeap @ognis1205 - FYI I have created two new issues to keep track of the discussion and next steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Microsoft Azure backed Delta tables
5 participants