Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] read_row_group fails with Nested data conversions not implemented for chunked array outputs #21526

Open
asfimport opened this issue Mar 27, 2019 · 4 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Mar 27, 2019

Hey, I'm trying to concatenate two files and to avoid reading everything to memory at once, I wanted to use read_row_group for my solution, but it fails.

 

I think it's due to fields like these:

pyarrow.Field<to: list<item: string>>

 

But I'm not sure. Is this a duplicate? The issue linked in the code is resolved

// ARROW-3762(wesm): If inout_array is a chunked array, we reject as this is

 

Stacktrace is

 

  File "/data/teftel/teftel-data/teftel_data/parquet_stream.py", line 163, in read_batches
    table = pf.read_row_group(ix, columns=self._columns)
  File "/home/kuba/.local/share/virtualenvs/teftel-o6G5iH_l/lib/python3.6/site-packages/pyarrow/parquet.py", line 186, in read_row_group
    use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 695, in pyarrow._parquet.ParquetReader.read_row_group
  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs

Reporter: Jakub Okoński

Related issues:

Note: This issue was originally created as ARROW-5030. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
I fixed some cases where this occurs in ARROW-4688, but it is still possible to hit this error for very large row groups (> 2GB of string data in a row group). I didn't see a follow up JIRA to this or ARROW-3762 so we can use this one for the issue

https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L915

@asfimport
Copy link
Collaborator Author

Judah:
[~wesm_impala_7e40] I'm also running into this issue. Is this likely to be fixed / easy to fix? I'd be happy to give it a go but not really sure where to start.

@Maxl94
Copy link

Maxl94 commented May 22, 2024

In case someone wants to load a pandas dataframe I want to share my workaround.

For me installing fastparquet and specifying the eninge='fastparquet' argument in the load_parquet function worked.

@yuxi-liu-wired
Copy link

In case someone wants to load a pandas dataframe I want to share my workaround.

For me installing fastparquet and specifying the eninge='fastparquet' argument in the load_parquet function worked.

Concurring. If the parquet file contains a dictionary/list/struct, then the following

import pandas as pd
df = pd.read_parquet(parquet_path)

throws an error "ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs"

But only if the parquet file is over 1 GB. If it is under 1 GB, then it loads with no problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants