-
Notifications
You must be signed in to change notification settings - Fork 824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPC code writes files which do not include a valid stream #6311
Comments
I tried adding validation to arrow-rs as with apache/arrow#43834 with validate-stream.zip |
Ben and I spoke about this a bit externally. As I understand it the main benefit is that it makes it simpler to consume the IPC format (you can consume both file and streaming formats with a single streaming implementation if you don't need the random access). I don't believe there are any drawbacks to restricting the alignment of the bytes following the magic header. I also don't envision any backwards compatibility concerns (readers that are using the flatbuffers footers can continue to accept whatever header alignment the producer is creating, the change is just to stop writing certain alignments so that readers that don't use the footer can start working) |
Describe the bug
The format guarantees that each IPC file embeds a valid IPC stream in order to allow readers to ignore the Footer, skip the file's leading magic, and reuse a stream reader.
However, when writing IPC files arrow-rs aligns the encapsulated flatbuffers Messages to 64 byte boundaries instead of 8 bytes. This can leave gaps of padding bytes between the Messages which a stream reader would not know to skip.
To Reproduce
apache/arrow#43834 adds validation of the embedded stream to arrow-c++'s
arrow-json-integration-test
. Running this against IPC files written by arrow-rs raises an errorExpected behavior
A stream reader should be able to read an IPC file by skipping the first 8 bytes.
Additional context
This was originally introduced in eddef43 which added alignment to address new requirements around
i128
. However the alignment should not be applied to flatbuffers Messages; apart from the above issue I think there's no SIMD or other advantage to aligning those to more than 8 bytes. Body buffers can of course still be padded and aligned freely.This was discovered while adding IPC file reading to nanoarrow; we were trying to defer reading Footers for a follow up and discovered that the go, rust, and javascript implementations don't embed a valid stream. Most readers have not noticed because offsets and schemas are more efficiently read from a Footer, and once acquired obviate sequential stream-style reading.
The text was updated successfully, but these errors were encountered: