Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch in key names between fsspec and object_store breaks pyarrow usage with AWS S3 #21076

Open
2 tasks done
lcorcodilos opened this issue Feb 4, 2025 · 1 comment
Open
2 tasks done
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@lcorcodilos
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import boto3

session = boto3.session.Session(profile_name="my-profile")
credentials = session.get_credentials().get_frozen_credentials()
storage_options = {
    "aws_access_key_id": credentials.access_key,
    "aws_secret_access_key": credentials.secret_key,
    "aws_session_token": credentials.token,
    "region_name": session.region_name,
}

pl.read_parquet(
    "s3://my-bucket/path/to/file.parquet",
    storage_options=storage_options,
    use_pyarrow=True,
)

Log output

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Projects/my-project/.venv/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Projects/my-project/.venv/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Projects/my-project/.venv/lib/python3.11/site-packages/polars/io/parquet/functions.py", line 203, in read_parquet
    return _read_parquet_with_pyarrow(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Projects/my-project/.venv/lib/python3.11/site-packages/polars/io/parquet/functions.py", line 278, in _read_parquet_with_pyarrow
    with prepare_file_arg(
         ^^^^^^^^^^^^^^^^^
  File "/Projects/my-project/.venv/lib/python3.11/site-packages/polars/io/_utils.py", line 232, in prepare_file_arg
    return fsspec.open(file, **storage_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Projects/my-project/.venv/lib/python3.11/site-packages/fsspec/core.py", line 452, in open
    out = open_files(
          ^^^^^^^^^^^
  File "/Projects/my-project/.venv/lib/python3.11/site-packages/fsspec/core.py", line 280, in open_files
    fs, fs_token, paths = get_fs_token_paths(
                          ^^^^^^^^^^^^^^^^^^^
  File "/Projects/my-project/.venv/lib/python3.11/site-packages/fsspec/core.py", line 633, in get_fs_token_paths
    paths = expand_paths_if_needed(paths, mode, num, fs, name_function)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Projects/my-project/.venv/lib/python3.11/site-packages/fsspec/core.py", line 563, in expand_paths_if_needed
    expanded_paths.extend(fs.glob(curr_path))
                          ^^^^^^^^^^^^^^^^^^
  File "/Projects/my-project/.venv/lib/python3.11/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Projects/my-project/.venv/lib/python3.11/site-packages/fsspec/asyn.py", line 103, in sync
    raise return_result
  File "/Projects/my-project/.venv/lib/python3.11/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
                ^^^^^^^^^^
  File "/Projects/my-project/.venv/lib/python3.11/site-packages/s3fs/core.py", line 785, in _glob
    return await super()._glob(path, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Projects/my-project/.venv/lib/python3.11/site-packages/fsspec/asyn.py", line 775, in _glob
    allpaths = await self._find(
               ^^^^^^^^^^^^^^^^^
  File "/Projects/my-project/.venv/lib/python3.11/site-packages/s3fs/core.py", line 813, in _find
    return await super()._find(
           ^^^^^^^^^^^^^^^^^^^^
  File "/Projects/my-project/.venv/lib/python3.11/site-packages/fsspec/asyn.py", line 841, in _find
    if withdirs and path != "" and await self._isdir(path):
                                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Projects/my-project/.venv/lib/python3.11/site-packages/s3fs/core.py", line 1411, in _isdir
    return bool(await self._lsdir(path))
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Projects/my-project/.venv/lib/python3.11/site-packages/s3fs/core.py", line 706, in _lsdir
    async for c in self._iterdir(
  File "/Projects/my-project/.venv/lib/python3.11/site-packages/s3fs/core.py", line 737, in _iterdir
    await self.set_session()
  File "/Projects/my-project/.venv/lib/python3.11/site-packages/s3fs/core.py", line 502, in set_session
    self.session = aiobotocore.session.AioSession(**self.kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: AioSession.__init__() got an unexpected keyword argument 'aws_access_key_id'

Issue description

When using use_pyarrow=True in read_parquet, AWS S3 reading breaks because of a mismatch between expected keys in object_store and s3fs. To be very specific, the mismatch is in the s3fs.S3FileSystem kwargs which is what polars is expecting to fill with the provided storage_options (1, 2).

These are the primary offending keys.

object_store s3fs
aws_access_key_id / access_key_id key
aws_secret_access_key / secret_access_key secret
aws_session_token / aws_token / session_token / token token
aws_region / region None

A temporary workaround is to wrap the intended storage_options with client_kwargs like so:

storage_options = {
    "client_kwargs": {
        "aws_access_key_id": credentials.access_key,
        "aws_secret_access_key": credentials.secret_key,
        "aws_session_token": credentials.token,
        "region_name": session.region_name,
    }
}

This will keep the intended storage_options (aws_*) from being loaded as kwargs into aiobotocore.session.AioSession by `s3fs.

The more permanent solution to regain expected behaviors could be pursued at (1), (2)) like so:

# storage_options["encoding"] = encoding -- commenting out just to indicate deletion
return fsspec.open(file, encoding=encoding, client_kwargs=storage_options)

I don't know enough about fsspec to understand the broader impacts here beyond AWS S3. My gut tells me that this could break other cloud resources and there may need to be an AWS check and instead do something like:

storage_options = {"encoding": encoding, client_kwargs=storage_options}

Expected behavior

The reproducible example runs without issue and loads the parquet from AWS S3.

Installed versions

--------Version info---------
Polars:              1.21.0
Index type:          UInt32
Platform:            macOS-15.3-arm64-arm-64bit
Python:              3.11.10 (main, Sep  7 2024, 01:03:31) [Clang 16.0.0 (clang-1600.0.26.3)]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                1.28.64
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2023.10.0
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                2.2.2
openpyxl             <not installed>
pandas               2.2.3
pyarrow              19.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@lcorcodilos lcorcodilos added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 4, 2025
@lcorcodilos
Copy link
Author

Created in response to discussion in #17864. Tagging @nameexhaustion for visibility. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

1 participant