AsyncArrowWriter doesn't limit underlying ArrowWriter to respect buffer-size #5450

DDtKey · 2024-03-01T16:13:52Z

AsyncArrowWriter created with default WriterProperties will have a config of DEFAULT_MAX_ROW_GROUP_SIZE = 1024 * 1024

It means that underlying ArrowWriter won't flush to disk until it's reached.
It leads to the incredible memory consumption, because it will cache up to DEFAULT_MAX_ROW_GROUP_SIZE (1048576 by default) and will ignore buffer_capacity config at all.

Because the flushing condition of sync writer is:

if in_progress.buffered_rows >= self.max_row_group_size {
    self.flush()?
}

To Reproduce
Try to write many large rows to parquet with AsyncArrowWriter, you will see the memory consumption doesn't respect buffer size.

UPD: MRE was created #5450 (comment)

Expected behavior
Perfectly, it should respect buffer config.
I.e flush on either buffer size or max row group is reached.

But even if it's expected for some reason, documentation should clearly highlight that.

Additional context

Btw, why default is 1024 * 1024? Like it's byte unites

The text was updated successfully, but these errors were encountered:

tustvold · 2024-03-01T17:41:45Z

The structure of parquet forces us to buffer an entire row group before we can flush it. The async writer should do a better job of calling this out

It consumed 10gb of memory accordingly.

Something is wrong here, it should only consume up to 10Mb, perhaps you could use a memory profiler to identify where the usage is coming from

alamb · 2024-03-01T17:48:58Z

Btw, why default is 1024 * 1024? Like it's byte unites

According to the docs, DEFAULT_MAX_ROW_GROUP_SIZE is number of rows, not bytes
https://docs.rs/parquet/latest/parquet/file/properties/constant.DEFAULT_MAX_ROW_GROUP_SIZE.html
https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterProperties.html#method.max_row_group_size

Returns maximum number of rows in a row group.

tustvold · 2024-03-01T18:04:18Z

Aah yes, I thought there was a mechanism to also limit the maximum size of row groups, but perhaps that is only for pages

DDtKey · 2024-03-01T19:21:39Z

The structure of parquet forces us to buffer an entire row group before we can flush it.

Yeah, that totally makes sense.

Something is wrong here, it should only consume up to 10Mb, perhaps you could use a memory profiler to identify where the usage is coming from

Well, it definitely was AsyncArrowWriter. Once I decreased max row group usage became normal - but still it caches up to row number limit ignoring any buffer limits.

And also I tried not to use arrow writer at all, but write to disk directly in streaming manner as is - there were no issues.

I think I can provide MRE easily. It will be easier to profile

DDtKey · 2024-03-01T19:22:52Z

According to the docs, DEFAULT_MAX_ROW_GROUP_SIZE is number of rows, not bytes

Yeah, that's totally okay. I mean default value is weird, like it's bytes

DDtKey · 2024-03-01T21:09:35Z

Here is simple MRE:
https://gist.github.com/DDtKey/706930c78dbb296899c2ef1bbf86459a
Memory keeps increasing over time. And after interuption - target file is 0 bytes.

Pay attention to lines 30-33, it can change everything => amount of consumed memory is stable (~40Gb file was generated before I interrupted):

tustvold · 2024-03-01T21:38:35Z

Currently this is expected behaviour, row groups are only automatically "closed" based on row count.

I would suggest the following:

Document that AsyncArrowWriter's buffer size is not authoritative, and is bounded by the size of the row groups produced
Add the ability to limit the maximum size of a row group before ArrowWriter creates a new row group ^1.

^1. Unlike the underlying SerializedFileWriter where the API is in terms of columns and therefore chunking is controlled by the caller, ArrowWriter could do a best-effort approach where it checks the in-progress size after writing each batch and determines whether to flush

DDtKey · 2024-03-01T21:45:58Z

Interestingly that with decreased max_row_group_size it also keeps increasing over time, but much much slower.

And this one with default one (with enabled delays in between, to slow-down process)

tustvold · 2024-03-01T21:50:33Z

Interestingly that with decreased max_row_group_size it also keeps increasing over time, but much much slower.

This is not unexpected, certain information such as indexes and statistics need to be retained until the footer is written, smaller row groups will make the overheads of this worse (and are why very small row groups are generally not a fantastic idea)

DDtKey · 2024-03-01T21:51:55Z

Oh, right, yes - that makes sense

DDtKey · 2024-03-01T21:55:44Z

Btw, can't we just explicitly enforce ArrowWriter to "flush" and start new row group right from AsyncArrowWriter in try_flush? 🤔

Because the main issue (if I'm not wrong)- we don't reach this condition because of large max_row_group_size:

arrow-rs/parquet/src/arrow/arrow_writer/mod.rs

Lines 219 to 221 in 3015122

    
           if in_progress.buffered_rows >= self.max_row_group_size { 
        
               self.flush()? 
        
           }

tustvold · 2024-03-01T21:58:58Z

Btw, can't we just explicitly enforce ArrowWriter to "flush" and start new row group right from AsyncArrowWriter in try_flush? 🤔

Yes, that is an option that is available to users, and with #5251 the necessary meta information is exposed to the clients to make this judgement for themselves.

However, as this has come up a few times, providing a conservative default limit of say 1GB is probably a sane modification, users can then lower this if they're happy to accept the trade-off of smaller row groups.

We don't want to use the buffer_size setting as this would then present an unfortunate trade-off where the limit would become the pre-allocation for the buffer, which we might never hit.

DDtKey · 2024-03-01T22:06:13Z

To be honest, my initial expectation was that I provide buffer-size (in AsyncArrowWriter) to ensure it flushes this amount of data.
But in fact, it doesn't at all, only number of rows is important.

Subjectively, it looks like we should flush underlying ArrowWriter (& start new row group) each time buffer has reached its capacity.

It's typical behavior at least for buffered writers, I don't expect it continue to keep so large amount of data.

Moreover, we have max_row_group_size, not min_row_group_size - so it's confusing currenly

tustvold · 2024-03-01T22:24:28Z

I agree it is potentially confusing, but I think the solution is to better document what buffer size is and is not, and potentially add separate functionality to ArrowWriter / AsyncArrowWriter to constrain row group size.

I am reticent to change this definition as the API breakage would be subtle and not immediately obvious, beyond larger files and worse query performance

DDtKey · 2024-03-01T22:32:35Z

I am reticent to change this definition as the API breakage would be subtle and not immediately obvious, beyond larger files and worse query performance

It's doable by creating a new writer type or methods with changed behavior and optionally deprecate old one.

Or something like that: AsynvArrowWriter::with_strict_buffer(..)(could be better name, just example) and internally have a flag to switch the behavior.

tustvold · 2024-03-01T22:56:57Z

I think we should start by documenting the current state of play and go from there, I'll try to get something up later today. It may be we can get by with just an example showing how to limit memory usage.

tustvold · 2024-03-03T21:05:19Z

FWIW #5458 tracks moving ObjectStore away from the somewhat problematic AsyncWrite abstraction

* Document parquet writer memory limiting (#5450) * Review feedback * Review feedback

tustvold · 2024-03-15T03:12:48Z

label_issue.py automatically added labels {'parquet'} from #5471

DDtKey added the bug label Mar 1, 2024

DDtKey mentioned this issue Mar 1, 2024

parquet::arrow::AsyncArrowWrite have memory leak #5214

Closed

tustvold added documentation Improvements or additions to documentation good first issue Good for newcomers enhancement Any new improvement worthy of a entry in the changelog help wanted and removed bug labels Mar 1, 2024

tustvold self-assigned this Mar 1, 2024

tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 2, 2024

Document parquet writer memory limiting (apache#5450)

179dfd2

tustvold mentioned this issue Mar 2, 2024

Document parquet writer memory limiting (#5450) #5457

Merged

tustvold mentioned this issue Mar 4, 2024

Provide access to inner Write for parquet writers #5471

Merged

tustvold closed this as completed in #5457 Mar 8, 2024

tustvold added a commit that referenced this issue Mar 8, 2024

Document parquet writer memory limiting (#5450) (#5457)

79634c0

* Document parquet writer memory limiting (#5450) * Review feedback * Review feedback

tustvold added the parquet Changes to the parquet crate label Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AsyncArrowWriter doesn't limit underlying ArrowWriter to respect buffer-size #5450

AsyncArrowWriter doesn't limit underlying ArrowWriter to respect buffer-size #5450

DDtKey commented Mar 1, 2024 •

edited

Loading

tustvold commented Mar 1, 2024 •

edited

Loading

alamb commented Mar 1, 2024

tustvold commented Mar 1, 2024

DDtKey commented Mar 1, 2024 •

edited

Loading

DDtKey commented Mar 1, 2024

DDtKey commented Mar 1, 2024 •

edited

Loading

tustvold commented Mar 1, 2024 •

edited

Loading

DDtKey commented Mar 1, 2024

tustvold commented Mar 1, 2024 •

edited

Loading

DDtKey commented Mar 1, 2024

DDtKey commented Mar 1, 2024 •

edited

Loading

tustvold commented Mar 1, 2024 •

edited

Loading

DDtKey commented Mar 1, 2024 •

edited

Loading

tustvold commented Mar 1, 2024 •

edited

Loading

DDtKey commented Mar 1, 2024 •

edited

Loading

tustvold commented Mar 1, 2024 •

edited

Loading

tustvold commented Mar 3, 2024

tustvold commented Mar 15, 2024

AsyncArrowWriter doesn't limit underlying ArrowWriter to respect buffer-size #5450

AsyncArrowWriter doesn't limit underlying ArrowWriter to respect buffer-size #5450

Comments

DDtKey commented Mar 1, 2024 • edited Loading

tustvold commented Mar 1, 2024 • edited Loading

alamb commented Mar 1, 2024

tustvold commented Mar 1, 2024

DDtKey commented Mar 1, 2024 • edited Loading

DDtKey commented Mar 1, 2024

DDtKey commented Mar 1, 2024 • edited Loading

tustvold commented Mar 1, 2024 • edited Loading

DDtKey commented Mar 1, 2024

tustvold commented Mar 1, 2024 • edited Loading

DDtKey commented Mar 1, 2024

DDtKey commented Mar 1, 2024 • edited Loading

tustvold commented Mar 1, 2024 • edited Loading

DDtKey commented Mar 1, 2024 • edited Loading

tustvold commented Mar 1, 2024 • edited Loading

DDtKey commented Mar 1, 2024 • edited Loading

tustvold commented Mar 1, 2024 • edited Loading

tustvold commented Mar 3, 2024

tustvold commented Mar 15, 2024

DDtKey commented Mar 1, 2024 •

edited

Loading

tustvold commented Mar 1, 2024 •

edited

Loading

DDtKey commented Mar 1, 2024 •

edited

Loading

DDtKey commented Mar 1, 2024 •

edited

Loading

tustvold commented Mar 1, 2024 •

edited

Loading

tustvold commented Mar 1, 2024 •

edited

Loading

DDtKey commented Mar 1, 2024 •

edited

Loading

tustvold commented Mar 1, 2024 •

edited

Loading

DDtKey commented Mar 1, 2024 •

edited

Loading

tustvold commented Mar 1, 2024 •

edited

Loading

DDtKey commented Mar 1, 2024 •

edited

Loading

tustvold commented Mar 1, 2024 •

edited

Loading