Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Experimental credential_provider argument for scan_parquet #19271

Merged
merged 10 commits into from
Oct 17, 2024

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Oct 17, 2024

ref Allow Overriding Object Store Credential Provider #18979

Adds an experimental credential_provider argument to scan_parquet, which can accept a callable:

def credential_provider() -> tuple[dict[str, str | None], int | None]

Where the dictionary contains the credentials, and the optional 2nd value represents the expiry time as seconds since unix epoch, allowing us to avoid calling the function again until the credentials have expired (if it is None, then the credentials will be interpreted as to never expire).

This enables using custom credential provisioning logic with cloud I/O scans, which allows for handling credential expiry.

Example for AWS

def get_credentials():
    import boto3

    session = boto3.Session(profile_name="profile2")
    creds = session.get_credentials()

    return {
        "aws_access_key_id": creds.access_key,
        "aws_secret_access_key": creds.secret_key,
        "aws_session_token": creds.token,
    }, None


lf = pl.scan_parquet(
    "s3://...",
    credential_provider=get_credentials,
)

Example for GCP

def get_credentials():
    from pathlib import Path

    import google.auth
    import google.auth.transport.requests
    import zoneinfo

    creds, _ = google.auth.load_credentials_from_file(
        Path.home() / ".config/gcloud/application_default_credentials.json"
    )

    auth_req = google.auth.transport.requests.Request()
    creds.refresh(auth_req)

    return {"bearer_token": creds.token}, int(
        creds.expiry.replace(
            # Google auth does not set this properly
            tzinfo=zoneinfo.ZoneInfo("UTC")
        ).timestamp()
    )


lf = pl.scan_parquet(
    "gs://...",
    credential_provider=get_credentials,
)

Azure

I don't have an Azure environment to test on, but the returned dictionary should contain { 'bearer_token': '...' }

Todos for follow-up PRs

  • Add the option to other cloud-enabled scan_/read_ functions
  • Add and link to a documentation page on the function return formats for different cloud types
  • Use an automatic default function for when the parameter is not specified

@nameexhaustion nameexhaustion changed the title c feat: Experimental credential_provider argument for scan_parquet Oct 17, 2024
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars and removed title needs formatting labels Oct 17, 2024

#[derive(Debug)]
pub struct PythonFunction(pub PyObject);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved PythonFunction from polars-plan/../python_udf.rs to polars-utils for re-use

Copy link

codecov bot commented Oct 17, 2024

Codecov Report

Attention: Patch coverage is 25.15337% with 366 lines in your changes missing coverage. Please review.

Project coverage is 80.00%. Comparing base (7472a76) to head (0cecce0).
Report is 31 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-io/src/cloud/credential_provider.rs 10.07% 339 Missing ⚠️
crates/polars-io/src/cloud/options.rs 51.85% 13 Missing ⚠️
crates/polars-utils/src/python_function.rs 84.61% 10 Missing ⚠️
py-polars/polars/io/parquet/functions.py 0.00% 2 Missing and 1 partial ⚠️
crates/polars-plan/src/dsl/expr_dyn_fn.rs 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #19271      +/-   ##
==========================================
- Coverage   80.11%   80.00%   -0.11%     
==========================================
  Files        1526     1528       +2     
  Lines      209338   209740     +402     
  Branches     2418     2419       +1     
==========================================
+ Hits       167707   167799      +92     
- Misses      41081    41390     +309     
- Partials      550      551       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ritchie46 ritchie46 merged commit 93f8902 into pola-rs:main Oct 17, 2024
29 checks passed
@c-peters c-peters added the accepted Ready for implementation label Oct 21, 2024
@nameexhaustion nameexhaustion deleted the credential-provider branch October 28, 2024 04:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants