-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AsyncArrowWriter doesn't limit underlying ArrowWriter to respect buffer-size #5450
Comments
The structure of parquet forces us to buffer an entire row group before we can flush it. The async writer should do a better job of calling this out
Something is wrong here, it should only consume up to 10Mb, perhaps you could use a memory profiler to identify where the usage is coming from |
According to the docs, DEFAULT_MAX_ROW_GROUP_SIZE is number of rows, not bytes
|
Aah yes, I thought there was a mechanism to also limit the maximum size of row groups, but perhaps that is only for pages |
Yeah, that totally makes sense.
Well, it definitely was And also I tried not to use arrow writer at all, but write to disk directly in streaming manner as is - there were no issues. I think I can provide MRE easily. It will be easier to profile |
Yeah, that's totally okay. I mean default value is weird, like it's bytes |
Here is simple MRE: Pay attention to lines 30-33, it can change everything => amount of consumed memory is stable (~40Gb file was generated before I interrupted): |
Currently this is expected behaviour, row groups are only automatically "closed" based on row count. I would suggest the following:
1. Unlike the underlying SerializedFileWriter where the API is in terms of columns and therefore chunking is controlled by the caller, ArrowWriter could do a best-effort approach where it checks the in-progress size after writing each batch and determines whether to flush |
This is not unexpected, certain information such as indexes and statistics need to be retained until the footer is written, smaller row groups will make the overheads of this worse (and are why very small row groups are generally not a fantastic idea) |
Oh, right, yes - that makes sense |
Btw, can't we just explicitly enforce Because the main issue (if I'm not wrong)- we don't reach this condition because of large arrow-rs/parquet/src/arrow/arrow_writer/mod.rs Lines 219 to 221 in 3015122
|
Yes, that is an option that is available to users, and with #5251 the necessary meta information is exposed to the clients to make this judgement for themselves. However, as this has come up a few times, providing a conservative default limit of say 1GB is probably a sane modification, users can then lower this if they're happy to accept the trade-off of smaller row groups. We don't want to use the buffer_size setting as this would then present an unfortunate trade-off where the limit would become the pre-allocation for the buffer, which we might never hit. |
To be honest, my initial expectation was that I provide buffer-size (in Subjectively, it looks like we should It's typical behavior at least for buffered writers, I don't expect it continue to keep so large amount of data. Moreover, we have |
I agree it is potentially confusing, but I think the solution is to better document what buffer size is and is not, and potentially add separate functionality to ArrowWriter / AsyncArrowWriter to constrain row group size. I am reticent to change this definition as the API breakage would be subtle and not immediately obvious, beyond larger files and worse query performance |
It's doable by creating a new writer type or methods with changed behavior and optionally deprecate old one. Or something like that: |
I think we should start by documenting the current state of play and go from there, I'll try to get something up later today. It may be we can get by with just an example showing how to limit memory usage. |
FWIW #5458 tracks moving ObjectStore away from the somewhat problematic AsyncWrite abstraction |
|
AsyncArrowWriter
created with defaultWriterProperties
will have a config ofDEFAULT_MAX_ROW_GROUP_SIZE = 1024 * 1024
It means that underlying
ArrowWriter
won't flush to disk until it's reached.It leads to the incredible memory consumption, because it will cache up to
DEFAULT_MAX_ROW_GROUP_SIZE
(1048576 by default) and will ignorebuffer_capacity
config at all.Because the flushing condition of sync writer is:
To Reproduce
Try to write many large rows to parquet with
AsyncArrowWriter
, you will see the memory consumption doesn't respect buffer size.UPD: MRE was created #5450 (comment)
Expected behavior
Perfectly, it should respect buffer config.
I.e flush on either buffer size or max row group is reached.
But even if it's expected for some reason, documentation should clearly highlight that.
Additional context
Btw, why default is
1024 * 1024
? Like it's byte unitesThe text was updated successfully, but these errors were encountered: