-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document parquet writer memory limiting (#5450) #5457
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -69,6 +69,29 @@ use tokio::io::{AsyncWrite, AsyncWriteExt}; | |
/// It is implemented based on the sync writer [`ArrowWriter`] with an inner buffer. | ||
/// The buffered data will be flushed to the writer provided by caller when the | ||
/// buffer's threshold is exceeded. | ||
/// | ||
/// ## Memory Limiting | ||
/// | ||
/// The nature of parquet forces buffering of an entire row group before it can be flushed | ||
/// to the underlying writer. This buffering may exceed the configured buffer size | ||
/// of [`AsyncArrowWriter`]. Memory usage can be limited by prematurely flushing the row group, | ||
/// although this will have implications for file size and query performance. See [ArrowWriter] | ||
/// for more information. | ||
Comment on lines
+73
to
+79
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great to have this documented! Thanks! Should we refer to this in instantiation methods? ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree it would help -- perhaps something like /// Please see the documentation on [`Self`] for details on memory usage.
|
||
/// | ||
/// ```no_run | ||
/// # use tokio::fs::File; | ||
/// # use arrow_array::RecordBatch; | ||
/// # use parquet::arrow::AsyncArrowWriter; | ||
/// # async fn test() { | ||
/// let mut writer: AsyncArrowWriter<File> = todo!(); | ||
/// let batch: RecordBatch = todo!(); | ||
/// writer.write(&batch).await.unwrap(); | ||
/// // Trigger an early flush if buffered size exceeds 1_000_000 | ||
/// if writer.in_progress_size() > 1_000_000 { | ||
/// writer.flush().await.unwrap() | ||
/// } | ||
/// # } | ||
/// ``` | ||
pub struct AsyncArrowWriter<W> { | ||
/// Underlying sync writer | ||
sync_writer: ArrowWriter<SharedBuffer>, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth suggesting to users that if they want to minimize memory overages when writing such data, they can send in smaller
RecordBatches
(e.g. split up viaRecordBatch::slice
for example) which gives the parquet writer more chances to check / flush?