-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Support nested data conversions for chunked array #32723
Comments
Will Jones / @wjones127: It sounds like there are some code paths that do work and some that don't. |
Arthur Passos: std::shared_ptr<arrow::Table> table;
arrow::Status read_status = file_reader->ReadRowGroup(row_group_current, column_indices, &table);
if (!read_status.ok())
throw ParsingException{"Error while reading Parquet data: " + read_status.ToString(), ErrorCodes::CANNOT_READ_ALL_DATA};
++row_group_current;
Now it's the below: std::shared_ptr<arrow::Table> table;
std::unique_ptr<::arrow::RecordBatchReader> rbr;
std::vector<int> row_group_indices { row_group_current };
arrow::Status get_batch_reader_status = file_reader->GetRecordBatchReader(row_group_indices, column_indices, &rbr);
if (!get_batch_reader_status.ok())
throw ParsingException{"Error while reading Parquet data: " + get_batch_reader_status.ToString(), ErrorCodes::CANNOT_READ_ALL_DATA};
arrow::Status read_status = rbr->ReadAll(&table);
if (!read_status.ok())
throw ParsingException{"Error while reading Parquet data: " + read_status.ToString(), ErrorCodes::CANNOT_READ_ALL_DATA};
++row_group_current;
Question: Should I expect any regressions or different behaviour by changing the code path to the latter?
|
Arthur Passos:
All the files I generated contain the above mentioned schema. The differences are in the data length. Some had maps of 50
I'd be very thankful if someone could give me some tips on how to generate a file that will trigger the exception. |
Will Jones / @wjones127: Here's a simple repro I created in Python: import pyarrow as pa
import pyarrow.parquet as pq
arr = pa.array([[("a" * 2**30, 1)]], type = pa.map_(pa.string(), pa.int32()))
arr = pa.chunked_array([arr, arr])
tab = pa.table({ "arr": arr })
pq.write_table(tab, "test.parquet")
pq.read_table("test.parquet")
#Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# File "/Users/willjones/mambaforge/envs/notebooks/lib/python3.10/site-#packages/pyarrow/parquet/__init__.py", line 2827, in read_table
# return dataset.read(columns=columns, use_threads=use_threads,
# File "/Users/willjones/mambaforge/envs/notebooks/lib/python3.10/site-#packages/pyarrow/parquet/__init__.py", line 2473, in read
# table = self._dataset.to_table(
# File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
# File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
# File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
# File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
#pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs |
Arthur Passos:
While your |
Will Jones / @wjones127: |
Arthur Passos: // ARROW-3762(wesm): If item reader yields a chunked array, we reject as
// this is not yet implemented
return Status::NotImplemented(
"Nested data conversions not implemented for chunked array outputs"); I wonder why this wasn't implemented. Is there a techinical limitation or the approach wasn't very well defined? I am pretty new to Parquet and to
|
Micah Kornfield / @emkornfield:
One way forward here could be to add an option for reading back arrays to always use the Large* variant (or maybe on a per column basis) to avoid chunking. |
Arthur Passos: |
Micah Kornfield / @emkornfield: |
Arthur Passos: I am stating this based on my current understanding of the inner workings of
edit: I have just tested the snippet shared by Will Jones using
|
Micah Kornfield / @emkornfield: To answer your questions, I don't think the second case applies. As far as I know Parquet C++ does its own chunking and doesn't try to read back the exact chunking that the values are written with. |
Arthur Passos: |
Micah Kornfield / @emkornfield: |
Arthur Passos: |
Micah Kornfield / @emkornfield: I think the right way of approach this is to have an option users can set (maybe one for each type) that will work on two levels:
So we keep the assertion but if users run into this issue we can provide guidance on how to set this. |
Arthur Passos: |
Micah Kornfield / @emkornfield: suppose you have 100 rows that take have 3GB of string evenly distributed. If you try to read all 100 rows it will overflow and create a ChunkedArray. If you read 50 rows at a time it would be an issue because chunking wouldn't be necessary. |
Arthur Passos:
My question is pretty much: should there be difference in the output when using the two APIs? |
Hi @emkornfield. I am the OP of this issue and I came across this again, so I think it's time work on this. I have re-implemented your suggestion to use LARGE* variants of String and Binary types in order to avoid chunking, see https://github.com/arthurpassos/arrow/pull/1/files. Ofc this is just a hard-coded version to validate it fixes the issue, apparently it did. Based on that, I have a couple of questions:
|
|
Cool, thanks. I have updated the draft PR with some refactorings, but it's no longer working. I suspect it's related to the dictionary encondig / decoding classes, they seem to be hard-coded to |
Enconding / decoding code is huge & somewhat complex, it would be great if I could skip changing that. Tons of changes and I am kind of afraid of introducing bugs.. |
I guess you can split the patch into multiple parts, and support read from |
Experiencing this issue on datasets like https://huggingface.co/datasets/mlfoundations/MINT-1T-HTML that we can't read in python File "pyarrow/_parquet.pyx", line 1587, in iter_batches
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs |
FileReaderImpl::ReadRowGroup
fails with "Nested data conversions not implemented for chunked array outputs". It fails on ChunksToSingleData schema is:
Is there a way to work around this issue in the cpp lib?
In any case, I am willing to implement this, but I need some guidance. I am very new to parquet (as in started reading about it yesterday).
Probably related to: https://issues.apache.org/jira/browse/ARROW-10958
Reporter: Arthur Passos
Assignee: Arthur Passos
Related issues:
Note: This issue was originally created as ARROW-17459. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: