Skip to content

BUG: read_parquet fails when passing a list of remote files (gcs) #62922

@ayushdg

Description

@ayushdg

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

# df = pd.read_parquet("gs://cloud-samples-data/bigquery/us-states/us-states.parquet") # works

df = pd.read_parquet(["gs://cloud-samples-data/bigquery/us-states/us-states.parquet"]) # fails

df = pd.read_parquet(["gs://cloud-samples-data/bigquery/us-states/us-states.parquet"], storage_options={"token":"anon"}) # fails with a different error

Issue Description

When passing in a list of files (remote) to read with read_parquet I see failures vs when I pass in the file path directly.

When I don't pass in storage options here's the error:

File /.venv/lib/python3.11/site-packages/pyarrow/dataset.py:368, in <listcomp>(.0)
    361 is_local = (
    362     isinstance(filesystem, (LocalFileSystem, _MockFileSystem)) or
    363     (isinstance(filesystem, SubTreeFileSystem) and
    364      isinstance(filesystem.base_fs, LocalFileSystem))
    365 )
    367 # allow normalizing irregular paths such as Windows local paths
--> 368 paths = [filesystem.normalize_path(_stringify_path(p)) for p in paths]
    370 # validate that all of the paths are pointing to existing *files*
    371 # possible improvement is to group the file_infos by type and raise for
    372 # multiple paths per error category
    373 if is_local:

File /.venv/lib/python3.11/site-packages/pyarrow/_fs.pyx:1012, in pyarrow._fs.FileSystem.normalize_path()

File /.venv/lib/python3.11/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File /.venv/lib/python3.11/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowInvalid: Expected a local filesystem path, got a URI: 'gs://cloud-samples-data/bigquery/us-states/us-states.parquet'

When I pass in any storage options

File .venv/lib/python3.11/site-packages/pandas/io/parquet.py:258, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs)
    256 if manager == "array":
    257     to_pandas_kwargs["split_blocks"] = True
--> 258 path_or_handle, handles, filesystem = _get_path_or_handle(
    259     path,
    260     filesystem,
    261     storage_options=storage_options,
    262     mode="rb",
    263 )
    264 try:
    265     pa_table = self.api.parquet.read_table(
    266         path_or_handle,
    267         columns=columns,
   (...)    270         **kwargs,
    271     )

File .venv/lib/python3.11/site-packages/pandas/io/parquet.py:129, in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
    123         fs, path_or_handle = fsspec.core.url_to_fs(
    124             path_or_handle, **(storage_options or {})
    125         )
    126 elif storage_options and (not is_url(path_or_handle) or mode != "rb"):
    127     # can't write to a remote url
    128     # without making use of fsspec at the moment
--> 129     raise ValueError("storage_options passed with buffer, or non-supported URL")
    131 handles = None
    132 if (
    133     not fs
    134     and not is_dir
   (...)    139     # fsspec resources can also point to directories
    140     # this branch is used for example when reading from non-fsspec URLs

ValueError: storage_options passed with buffer, or non-supported URL

Expected Behavior

read_parquet succeeding whether we pass in a single path to a directory/file or a list of files to read.

Workarounds:
Creating and passing in a filesystem object explicitly in the read_parquet works or reading files one by one and concatenating is another option.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 9c8bc3e python : 3.11.2 python-bits : 64 OS : Linux OS-release : 6.1.0-40-cloud-amd64 Version : #1 SMP PREEMPT_DYNAMIC Debian 6.1.153-1 (2025-09-20) machine : x86_64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.3.3
numpy : 1.26.4
pytz : 2025.2
dateutil : 2.9.0.post0
pip : None
Cython : None
sphinx : None
IPython : 9.6.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.14.2
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.12.0
html5lib : None
hypothesis : None
gcsfs : 2024.12.0
jinja2 : 3.1.6
lxml.etree : 5.4.0
matplotlib : None
numba : 0.61.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 22.0.0
pyreadstat : None
pytest : 8.4.2
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.16.3
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2025.2
qtpy : None
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugError ReportingIncorrect or improved errors from pandasIO Parquetparquet, feather

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions