High Memory Usage and Long GC Times When Writing Parquet Files #3102

ccl125 · 2024-12-10T14:31:29Z

Describe the usage question you have. Please include as many useful details as possible.

In my project, I am using the following code to write Parquet files to the server:

ParquetWriter parquetWriter = ExampleParquetWriter.builder(new Path(filePath))
.withConf(new Configuration())
.withType(messageType)
.build();

Each Parquet file contains 30000 columns. This code is executed by multiple threads simultaneously, which results in increased GC time. Upon analyzing memory usage, I found that the main memory consumers are related to the following chain:

InternalParquetRecordWriter -> ColumnWriterV1 -> FallbackValuesWriter -> PlainDoubleDictionaryValuesWriter -> IntList

Each thread writes to a file with the same table schema (header), differing only in the filePath.

I initially suspected that the memory usage was caused by the file buffer not being flushed in time. To address this, I tried configuring the writer with the following parameters:

parquetWriter = ExampleParquetWriter.builder(new Path(filePath))
.withConf(new Configuration())
.withType(messageType)
.withMinRowCountForPageSizeCheck(SpringContextUtils.getApplicationContext()
.getBean(EtlTaskProperties.class).getMinRowCountForPageSizeCheck())
.withMaxRowCountForPageSizeCheck(SpringContextUtils.getApplicationContext()
.getBean(EtlTaskProperties.class).getMaxRowCountForPageSizeCheck())
.withRowGroupSize(SpringContextUtils.getApplicationContext()
.getBean(EtlTaskProperties.class).getRowGroupSize())
.build();

However, these adjustments did not solve the issue. The program still experiences long GC pauses and excessive memory usage.

Expected Behavior

Efficient Parquet file writing with reduced GC time and optimized memory usage when multiple threads are writing files simultaneously.

Observed Behavior
• Increased GC time and excessive memory usage.
• Memory analysis indicates IntList under PlainDoubleDictionaryValuesWriter is the primary consumer of memory.

Request

What are the recommended strategies to mitigate excessive memory usage in this scenario?
Is there a way to share table schema-related objects across threads, or other optimizations to reduce memory overhead?

Please let me know if additional information is needed!

No response

ccl125 · 2024-12-12T02:13:16Z

I noticed that when I set withDictionaryEncoding(false), the writer switches from using FallbackValuesWriter to PlainValuesWriter. These two have significantly different memory usage. It seems that using PlainValuesWriter might address my issue.

Here is the context:
• Each file has a fixed 500 rows.
• The number of columns varies, ranging from approximately 1 to 30,000.

I would like to know:
1. Can I directly solve the problem by setting withDictionaryEncoding(false)?
2. How will this impact file size, write efficiency, and read performance?

ccl125 closed this as completed Dec 10, 2024

ccl125 reopened this Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Memory Usage and Long GC Times When Writing Parquet Files #3102

High Memory Usage and Long GC Times When Writing Parquet Files #3102

ccl125 commented Dec 10, 2024 •

edited

Loading

ccl125 commented Dec 12, 2024

High Memory Usage and Long GC Times When Writing Parquet Files #3102

High Memory Usage and Long GC Times When Writing Parquet Files #3102

Comments

ccl125 commented Dec 10, 2024 • edited Loading

Describe the usage question you have. Please include as many useful details as possible.

ccl125 commented Dec 12, 2024

ccl125 commented Dec 10, 2024 •

edited

Loading