Skip to content

[Parquet] Use u64 for SerializedPageReaderState.offset & remaining_bytes, instead of usize #7918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 14, 2025

Conversation

JigaoLuo
Copy link
Contributor

@JigaoLuo JigaoLuo commented Jul 12, 2025

Which issue does this PR close?

Rationale for this change

There is a copy from my issue page:

/// The current byte offset in the reader
offset: usize,

My concern is about the type of offset in SerializedPageReaderState. Should it be u64 instead of usize? If I understand correctly, this offset represents a global position within a Parquet file, which can easily exceed 4 GB. On 32-bit environments (e.g., WebAssembly), usize is limited to u32's max, which could lead to problems with larger files.

What changes are included in this PR?

This PR does type changes only for SerializedPageReaderState.offset & remaining_bytes

Are these changes tested?

I can pass with local unit tests via cargo test -p parquet

Are there any user-facing changes?

No

JigaoLuo added 2 commits July 12, 2025 15:03
Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 12, 2025
@JigaoLuo JigaoLuo changed the title Use u64 for SerializedPageReaderState.offset & remaining_bytes, instead of usize [Parquet] Use u64 for SerializedPageReaderState.offset & remaining_bytes, instead of usize Jul 12, 2025
Signed-off-by: Jigao Luo <[email protected]>
Copy link
Contributor

@XiangpengHao XiangpengHao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @JigaoLuo this looks good to me! left minor optional doc change.

Signed-off-by: Jigao Luo <[email protected]>
@JigaoLuo
Copy link
Contributor Author

Thank you @XiangpengHao for reviewing. I have updated the doc.

@alamb alamb added the api-change Changes to the arrow API label Jul 12, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JigaoLuo and @XiangpengHao -- this makes sense to me

@etseidl or @jhorstmann do you have any concerns about this change?

@jhorstmann
Copy link
Contributor

Looks good to me, thanks!

@alamb alamb merged commit 5555d30 into apache:main Jul 14, 2025
17 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 14, 2025

Thanks again @JigaoLuo and thanks @mbrobbel and @jhorstmann

@JigaoLuo
Copy link
Contributor Author

Thank you to all the reviewers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Parquet] [Question / Potential Bug Report] Should SerializedPageReaderState.offset & remaining_bytes be u64 instead of usize?
5 participants