Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 region not able to be automatically determined #13940

Closed
2 tasks done
bouianpe opened this issue Jan 24, 2024 · 5 comments
Closed
2 tasks done

S3 region not able to be automatically determined #13940

bouianpe opened this issue Jan 24, 2024 · 5 comments
Labels
A-io-cloud Area: reading/writing to cloud storage bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@bouianpe
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.scan_parquet("s3://REDACTED/test_file.parquet")

Log output

REDACTED\tools\parquet\polars_s3_test.py:8: UserWarning: '(default_)region' not set; polars will try to get it from bucket

Set the region manually to silence this warning.
  df = pl.scan_parquet("s3://REDACTED/test_file.parquet")
Traceback (most recent call last):
  File "REDACTED\tools\parquet\polars_s3_test.py", line 8, in <module>
    df = pl.scan_parquet("s3://REDACTED/test_file.parquet")
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\_applications\Python311\Lib\site-packages\polars\utils\deprecation.py", line 133, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\_applications\Python311\Lib\site-packages\polars\utils\deprecation.py", line 133, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\_applications\Python311\Lib\site-packages\polars\io\parquet\functions.py", line 311, in scan_parquet
    return pl.LazyFrame._scan_parquet(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\_applications\Python311\Lib\site-packages\polars\lazyframe\frame.py", line 455, in _scan_parquet
    self._ldf = PyLazyFrame.new_from_parquet(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: Generic S3 error: Missing bucket name

Issue description

S3 region is not being read correctly from .aws/config on our developer machines. All of the machines are set up as below:

.aws/credentials

[default]
aws_access_key_id = REDACTED
aws_secret_access_key = REDACTED

.aws/config

[default]
region=ca-central-1

Expected behavior

default region to be read from .aws/config correctly.
The current workaround solution is to provide the region from the boto3 session, which is able to automatically read it from the same .aws/config

import polars as pl

df = pl.scan_parquet("s3://REDACTED/test_file.parquet", storage_options={"region": boto3.Session().region_name})

Installed versions

--------Version info---------
Polars:               0.20.5
Index type:           UInt32
Platform:             Windows-10-10.0.17763-SP0
Python:               3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2023.5.0
gevent:               <not installed>
hvplot:               0.9.1
matplotlib:           3.7.1
numpy:                1.24.3
openpyxl:             3.1.2
pandas:               1.5.3
pyarrow:              12.0.0
pydantic:             2.1.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.20
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@bouianpe bouianpe added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 24, 2024
@deanm0000 deanm0000 added needs triage Awaiting prioritization by a maintainer and removed needs triage Awaiting prioritization by a maintainer labels Jan 24, 2024
@stinodego stinodego added the A-io-cloud Area: reading/writing to cloud storage label Jan 24, 2024
@ritchie46
Copy link
Member

Which is where we read it from?

Isn't it under ~/.aws?

@bouianpe
Copy link
Author

Yup, the files are in the usual Windows location that AWS CLI uses (and the key/secret themselves are being picked up fine)
C:\Users\bouianpe\.aws

@lee170
Copy link

lee170 commented Feb 14, 2024

I am encountering a similar error with read_parquet(), using 0.20.8, which occurs if the region is specified with storage_options or not. The region is the same as that under the default config in ~/.aws.

import polars as pl
import argparse


def read_s3(s3uri):
    sdf = pl.read_parquet(s3uri, storage_options={"region": "us-east-1"})
    print("polars shape:", sdf.shape)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Read parquet file from s3")
    parser.add_argument("s3uri", type=str, help="s3 uri to parquet file")
    args = parser.parse_args()
    read_s3(args.s3uri)

(base) lee@lightsail:~/msp/mbs-parsers$ python polars_read_s3.py s3://machinelee-dev/test/mytest.parquet
/home/lee/msp/mbs-parsers/polars_read_s3.py:6: UserWarning: '(default_)region' not set; polars will try to get it from bucket

Set the region manually to silence this warning.
sdf = pl.read_parquet(s3uri)
Traceback (most recent call last):
File "/home/lee/msp/mbs-parsers/polars_read_s3.py", line 14, in
read_s3(args.s3uri)
File "/home/lee/msp/mbs-parsers/polars_read_s3.py", line 6, in read_s3
sdf = pl.read_parquet(s3uri)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/lee/miniconda3/lib/python3.11/site-packages/polars/utils/deprecation.py", line 136, in wrapper
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lee/miniconda3/lib/python3.11/site-packages/polars/utils/deprecation.py", line 136, in wrapper
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lee/miniconda3/lib/python3.11/site-packages/polars/io/parquet/functions.py", line 171, in read_parquet
lf = scan_parquet(
^^^^^^^^^^^^^
File "/home/lee/miniconda3/lib/python3.11/site-packages/polars/utils/deprecation.py", line 136, in wrapper
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lee/miniconda3/lib/python3.11/site-packages/polars/utils/deprecation.py", line 136, in wrapper
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lee/miniconda3/lib/python3.11/site-packages/polars/io/parquet/functions.py", line 311, in scan_parquet
return pl.LazyFrame._scan_parquet(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lee/miniconda3/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 464, in _scan_parquet
self._ldf = PyLazyFrame.new_from_parquet(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: Generic S3 error: Missing bucket name

@babymastodon
Copy link

In case it helps, I'm also seeing this issue on mac, polars==1.5.0.

I have my AWS credentials in ~/.aws/credentials, but polars does not find these credentials when I run pl.read_parquet().

pandas.read_parquet() works fine.

This issue seems similar to #15838

@nameexhaustion
Copy link
Collaborator

This should be fixed on main by #18259

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-cloud Area: reading/writing to cloud storage bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

7 participants