-
Notifications
You must be signed in to change notification settings - Fork 827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GenericColumnReader::read_records Yields Truncated Records #5150
Comments
I'm confused by this, read_records was added and read_batch deprecated precisely because read_batch didn't guarantee this, whereas read_records does |
You shouldn't be, I'm aware that I'm most likely the confused one here. We've been using the above code and it's been working fine for years. When we upgraded the parquet dep this code started crashing due with the naive implementation and I'm aware that the answer to this question might be to rewrite it. In this case we know that the input is well-formed, and we expected there to be something akin to |
Perhaps you could share a reproducer of the crash, aa making the "naive" version correct was the motivation for the change, and so it should work. Previously it would have only been working because the writer wasn't strict about torn records - fixed in #4327 |
To clarify, i think #4327 (comment) your comment there is relevant, as that is the error we are running into. and the parquet files that we wrote previously are used at scale globally from AWS athena and presto readers, so they are very much not invalid. |
There is context on the linked ticket on how they are invalid, they will cause subtle bugs for systems performing predicate pushdown using the page index - something at least historically Athena/Presto/Trino has not supported. However, taking a step back here the whole purpose of read_records is to avoid users needing to concern themselves with this, and make the naive loop "just work". I am therefore interested in what has gone wrong here? |
Ok, when you say "invalid" my assumption was that you meant "broken in an unsable state", not that it may cause some issues on certain use cases. my mistake.
As are we. we'll try and get a minimal/sanitized reprod. |
Here is a link to a reproduction: https://github.com/ogrman/test-parquet-read-records The output from this program is:
My expectation for this test program would be that |
Aah I see what the issue here is, thank you for the reproducer. The problem is read_records will currently return incomplete reads if there isn't sufficient buffer space to accommodate the requested number of records. This is fine for the arrow APIs as RecordReader ensures that it then grows the buffers and reads out the remaining data. Unfortunately RecordReader is currently crate private, extremely specific to how the arrow decoding process works, and not really something I would want to expose. On the flip-side ColumnWriter needs to ensure it has complete records, as otherwise I will see if I can't make read_records behave the way it is documented to behave, and never return truncated records 😅 |
GenericColumnReader
|
|
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Prior to version 41 of the parquet crate we had access to the
read_batch
function, which was deprecated (and changed) in favor ofread_records
. What we are trying to does not seem to be possible with the new API. We have a program that reads parquet files and concatenates them vertically, so that a number of parquet files with identical schemas become one file.We did this by, for each input file and column:
This simple loop turned many "small" files into one large file, with the same schema. After this change when we replace the call to
read_batch
with a call toread_records
we will no longer get a complete batch which means that sometimes we will start writing a new batch while rep_levels is still 1.Describe the solution you'd like
A way to simply read a complete batch without fiddling around with realigning our buffers between writes. I am also open to suggestions for why what we are doing is better solved in a different way, but the code we have works great with previous versions of parquet and we are currently blocked from upgrading.
Describe alternatives you've considered
Manually finding the last element with rep_levels = 0 and stopping our reads there, doing some math, writing a batch excluding the end of the buffers, copying the end of the buffers to the start of our buffers, and reading fewer records according to how much space is already used in the buffers.
The text was updated successfully, but these errors were encountered: