-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document parquet writer memory limiting (#5450) #5457
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think would be nice to have an additional config(or method) to "enforce flush on buffer size". To be able to encapsulate this logic for user's code 🤔
Making it explicit in the client implementation at least makes it very obvious how it behaves.
To be honest, I'm not sure it's clearer. It's documented now - yes. But we have buffer config and in fact not clear what it does for user. And we can't enforce it.
Just for example, take a look at current documentation of buffer size:
///
buffer_size
determines the number of bytes to buffer before flushing
/// to the underlying [AsyncWrite
]
But it can be done/discussed separately ofc
/// ## Memory Limiting | ||
/// | ||
/// The nature of parquet forces buffering of an entire row group before it can be flushed | ||
/// to the underlying writer. This buffering may exceed the configured buffer size | ||
/// of [`AsyncArrowWriter`]. Memory usage can be limited by prematurely flushing the row group, | ||
/// although this will have implications for file size and query performance. See [ArrowWriter] | ||
/// for more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to have this documented! Thanks!
Should we refer to this in instantiation methods? (try_new(_with_options)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it would help -- perhaps something like try_new
in try_new/try_new_with_options
/// Please see the documentation on [`Self`] for details on memory usage.
No objections from me, FWIW my conclusion from the investigation under #5458 is we should probably encourage people to move away from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// ## Memory Limiting | ||
/// | ||
/// The nature of parquet forces buffering of an entire row group before it can be flushed | ||
/// to the underlying writer. This buffering may exceed the configured buffer size | ||
/// of [`AsyncArrowWriter`]. Memory usage can be limited by prematurely flushing the row group, | ||
/// although this will have implications for file size and query performance. See [ArrowWriter] | ||
/// for more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it would help -- perhaps something like try_new
in try_new/try_new_with_options
/// Please see the documentation on [`Self`] for details on memory usage.
/// | ||
/// ## Memory Limiting | ||
/// | ||
/// The nature of parquet forces buffering of an entire row group before it can be flushed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth suggesting to users that if they want to minimize memory overages when writing such data, they can send in smaller RecordBatches
(e.g. split up via RecordBatch::slice
for example) which gives the parquet writer more chances to check / flush?
Hi, thanks for the document here, I think I can also copying this to C++ Parquet lib. By the way, do you think it's necessary to also noted that dictionary encoding might be memory consuming? (Since maybe a huge dict woule be allocated) |
We have a maximum dictionary page size that I think should bound this, I believe the C++ parquet lib does something similar |
Generally yeah, but in fact, when writing to a wide table with several hundreds of columns, it will still be a huge overhead...Velox will disable dictionary encoding when memory usage is high. I think we can also note that? |
Which issue does this PR close?
Closes #5450
Rationale for this change
What changes are included in this PR?
This adds an example showing how to limit the growth of the buffers within the arrow writers. I debated exposing a setting to control this, but the way it would only ever be a best effort limit felt a little off. Making it explicit in the client implementation at least makes it very obvious how it behaves.
Are there any user-facing changes?