Skip to content

Commit

Permalink
[BUG] Bump Parquet reader max_page_size to 256MB (#1553)
Browse files Browse the repository at this point in the history
Bumps the max_page_size to `usize::MAX`

Note that it appears parquet2 uses this argument for BOTH the max **page
header** thrift structure size, and also the **page data** size itself.
This is different from how other libraries do it.

Here we set it to 256MB:

1. This is a very large number that should capture most corner-cases
2. If anything is larger than this number there is most likely an issue
with the Parquet file itself that should be addressed
3. Reading in 256MB of page data at once is still unlikely to OOM a
system so we should be very safe there

Closes: #1551

---------

Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>
  • Loading branch information
jaychia and Jay Chia authored Oct 31, 2023
1 parent fa59676 commit 55c9a79
Showing 1 changed file with 8 additions and 2 deletions.
10 changes: 8 additions & 2 deletions src/daft-parquet/src/file.rs
Original file line number Diff line number Diff line change
Expand Up @@ -398,7 +398,10 @@ impl ParquetFileReader {
range_reader,
vec![],
Arc::new(|_, _| true),
4 * 1024 * 1024,
// Set to a very high number 256MB to guard against unbounded large
// downloads from remote storage, which likely indicates corrupted Parquet data
// See: https://github.com/Eventual-Inc/Daft/issues/1551
256 * 1024 * 1024,
)
.await
.with_context(|_| {
Expand Down Expand Up @@ -564,7 +567,10 @@ impl ParquetFileReader {
range_reader,
vec![],
Arc::new(|_, _| true),
4 * 1024 * 1024,
// Set to a very high number 256MB to guard against unbounded large
// downloads from remote storage, which likely indicates corrupted Parquet data
// See: https://github.com/Eventual-Inc/Daft/issues/1551
256 * 1024 * 1024,
)
.await
.with_context(|_| {
Expand Down

0 comments on commit 55c9a79

Please sign in to comment.