Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document parquet writer memory limiting (#5450) #5457

Merged
merged 3 commits into from
Mar 8, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions parquet/src/arrow/arrow_writer/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,32 @@ mod levels;
///
/// assert_eq!(to_write, read);
/// ```
///
/// ## Memory Limiting
///
/// The nature of parquet forces buffering of an entire row group before it can be flushed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth suggesting to users that if they want to minimize memory overages when writing such data, they can send in smaller RecordBatches (e.g. split up via RecordBatch::slice for example) which gives the parquet writer more chances to check / flush?

/// to the underlying writer. Data is buffered in its encoded form, to reduce memory usage,
/// but if writing rows containing large strings or very nested data, this may still result in
/// non-trivial memory usage.
///
/// [`ArrowWriter::in_progress_size`] can be used to track the size of the buffered row group,
/// and potentially trigger an early flush of a row group based on a memory threshold and/or
/// global memory pressure. However, users should be aware that smaller row groups will result
/// in higher metadata overheads, and may worsen compression ratios and query performance.
///
/// ```no_run
/// # use std::io::Write;
/// # use arrow_array::RecordBatch;
/// # use parquet::arrow::ArrowWriter;
/// # let mut writer: ArrowWriter<Vec<u8>> = todo!();
/// # let batch: RecordBatch = todo!();
/// writer.write(&batch).unwrap();
/// // Trigger an early flush if buffered size exceeds 1_000_000
/// if writer.in_progress_size() > 1_000_000 {
/// writer.flush().unwrap();
/// }
/// ```
///
pub struct ArrowWriter<W: Write> {
/// Underlying Parquet writer
writer: SerializedFileWriter<W>,
Expand Down
23 changes: 23 additions & 0 deletions parquet/src/arrow/async_writer/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,29 @@ use tokio::io::{AsyncWrite, AsyncWriteExt};
/// It is implemented based on the sync writer [`ArrowWriter`] with an inner buffer.
/// The buffered data will be flushed to the writer provided by caller when the
/// buffer's threshold is exceeded.
///
/// ## Memory Limiting
///
/// The nature of parquet forces buffering of an entire row group before it can be flushed
/// to the underlying writer. This buffering may exceed the configured buffer size
/// of [`AsyncArrowWriter`]. Memory usage can be limited by prematurely flushing the row group,
/// although this will have implications for file size and query performance. See [ArrowWriter]
/// for more information.
Comment on lines +73 to +79
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to have this documented! Thanks!

Should we refer to this in instantiation methods? (try_new(_with_options))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would help -- perhaps something like try_new in try_new/try_new_with_options

/// Please see the documentation on [`Self`] for details on memory usage.

///
/// ```no_run
/// # use tokio::fs::File;
/// # use arrow_array::RecordBatch;
/// # use parquet::arrow::AsyncArrowWriter;
/// # async fn test() {
/// let mut writer: AsyncArrowWriter<File> = todo!();
/// let batch: RecordBatch = todo!();
/// writer.write(&batch).await.unwrap();
/// // Trigger an early flush if buffered size exceeds 1_000_000
/// if writer.in_progress_size() > 1_000_000 {
/// writer.flush().await.unwrap()
/// }
/// # }
/// ```
pub struct AsyncArrowWriter<W> {
/// Underlying sync writer
sync_writer: ArrowWriter<SharedBuffer>,
Expand Down
Loading