PARQUET-869: Configurable record counts for block size checks #470

rdblue · 2018-05-01T00:23:15Z

This PR adds on #447 and updates the properties to use "row group" instead of "block" because block is confusing. It also fixes the outstanding review comments so this can be merged.

Closes #447.

gszadovszky

I only have the annoying backward compatibility findings.

gszadovszky · 2018-06-20T06:17:31Z

parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java

@@ -45,8 +45,11 @@
  public static final boolean DEFAULT_IS_DICTIONARY_ENABLED = true;
  public static final WriterVersion DEFAULT_WRITER_VERSION = WriterVersion.PARQUET_1_0;
  public static final boolean DEFAULT_ESTIMATE_ROW_COUNT_FOR_PAGE_SIZE_CHECK = true;
-  public static final int DEFAULT_MINIMUM_RECORD_COUNT_FOR_CHECK = 100;


Not sure if anyone would use such constants but it is a braking change to remove them. It might be a good idea to deprecate them instead and use the new ones internally.

gszadovszky · 2018-06-20T06:20:35Z

parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java

  public ValuesWriterFactory getValuesWriterFactory() {
    return valuesWriterFactory;
  }

-  public boolean estimateNextSizeCheck() {


I suggest deprecating instead of removing.

rdblue · 2018-06-20T18:02:18Z

@gszadovszky, ParquetProperties isn't considered part of the public API. We have needed to change it over time. See 6b605a4#diff-84c5123768d3411558f1761aca7559a9 for an example.

gszadovszky · 2018-06-20T18:21:13Z

I can accept that but how would the API client know that? We already know some modifications in the "internal API" which caused problems to our clients.
Until we solve the segregation of the internal and public API I would suggest to try to not break the compatibility of any publicly accessible members of the java API.

rdblue · 2018-06-22T23:32:56Z

None of the org.apache.parquet.column classes are public (see https://github.com/apache/parquet-mr/blob/master/pom.xml#L250). I know it is annoying to not have a public API, but I think it is much worse to slow development to maintain compatibility on internal classes than to break the few people who were for some unknown reason using an internal API with little use outside of the project.

zivanfi

In my opinion precedents of earlier breaking changes do not justify adding more of them. Parquet is a leading file format of big data applications and as such should fully respect the semantic versioning rules for backwards compatibility. We would like breaking changes in Parquet to be taken more seriously and advocate not following the bad example of what was done earlier any more.

Your point that the compatibility of the leaked parts of the API is a pain is true, but I think that we (= Parquet developers) should be the ones who feel this pain and deal with these issues. We should not push this burden onto the developers who consume our library and do not know which parts of the API were intended to be public and which parts were just leaked by accident, because we did not communicate this distinction properly. If anything, the burden of maintaining compatibility should serve as a motivation to define our API more clearly.

We all agreed that the lack of a well-defined API is a problem that can only be fixed properly in the next major release. However, I don't think that means that we should neglect compatibility in the meantime. On the contrary, now that we have set a goal for a proper API, we should limit our breaking changes to a minimum until we get there.

There may be a level of required effort when the cost of maintaining compatibility outweigh its advantages, but in simple cases like this, where remaining backwards compatible takes literally zero effort, I see no reason for introducing a breaking change.

rdblue · 2018-06-26T16:22:58Z

I don't agree with a requirement for full binary compatibility across the entire codebase because we lack a public API. We already have significant drag from maintaining compatibility in the classes that are public and I think it's a bad idea to introduce that problem everywhere.

Let's work on a public API if not having one is going to prevent us from making reasonable changes to internal classes.

@julienledem, any thoughts on this?

pwais · 2019-07-04T22:18:19Z

@zivanfi Will there be any progress on this fix? Note disclosure on https://eng.uber.com/petastorm/ ... sounds like people are forking parquet in order to get around this bug.

raymondchen-byte · 2020-01-14T03:01:57Z

We have a large number of images that need to be stored through parquet, but encountered the above situation, I hope to promote this optimization point as soon as possible, which is very helpful for our work, thanks

panthony · 2020-08-28T10:42:54Z

Same here, we have 1 or 2 columns that can vary widely in size (few Kbs up to 10Mb) and we often stumble upon an OutOfMemory error because it didn't check the buffered rows in time.

Being able to adjust the checks frequency would be a huge help 👍

I have a rebased branch against master if anyone interested

livelace · 2020-11-29T21:45:11Z

Well, I'm glad that I have found this bug before I started to save images into parquet files.

panthony · 2020-11-30T06:51:36Z

@livelace If at one point you consider using parquet with binary files (or any big columns) know that increasing the frequency of checks may not be enough (it was not for me).

I had to fork parquet to be able to opt-out the computation of statistics for some of my columns, see:

https://issues.apache.org/jira/browse/PARQUET-1911

From this moment I never had any OOM issue.

Pradeep Gollakota and others added 2 commits April 30, 2018 17:20

PARQUET-869 Configurable record counts for block size checks

943dcb5

PARQUET-869: Use "row group" instead of "block".

4d5dd4e

rdblue force-pushed the PARQUET-869-configurable-row-group-min-max-record-check branch from de6c312 to 4d5dd4e Compare May 1, 2018 00:26

rdblue mentioned this pull request Jun 18, 2018

PARQUET-409: Add a configuration key that controls min/max row count for block size check #495

Closed

gszadovszky requested changes Jun 20, 2018

View reviewed changes

zivanfi requested changes Jun 26, 2018

View reviewed changes

rdblue mentioned this pull request Jul 4, 2019

PARQUET-869 Configurable min/max record counts for block size check #447

Open

selitvin mentioned this pull request Jul 19, 2021

Commit a parquet-mr patch that enables writing out row-group sizes smaller than 100 uber/petastorm#90

Open

asfimport mentioned this pull request Jun 23, 2024

Min/Max record counts for block size checks are not configurable #2038

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-869: Configurable record counts for block size checks #470

PARQUET-869: Configurable record counts for block size checks #470

rdblue commented May 1, 2018

gszadovszky left a comment

gszadovszky Jun 20, 2018

gszadovszky Jun 20, 2018

rdblue commented Jun 20, 2018 •

edited

Loading

gszadovszky commented Jun 20, 2018

rdblue commented Jun 22, 2018

zivanfi left a comment

rdblue commented Jun 26, 2018

pwais commented Jul 4, 2019 •

edited

Loading

raymondchen-byte commented Jan 14, 2020

panthony commented Aug 28, 2020 •

edited

Loading

livelace commented Nov 29, 2020

panthony commented Nov 30, 2020 •

edited

Loading

PARQUET-869: Configurable record counts for block size checks #470

Are you sure you want to change the base?

PARQUET-869: Configurable record counts for block size checks #470

Conversation

rdblue commented May 1, 2018

gszadovszky left a comment

Choose a reason for hiding this comment

gszadovszky Jun 20, 2018

Choose a reason for hiding this comment

gszadovszky Jun 20, 2018

Choose a reason for hiding this comment

rdblue commented Jun 20, 2018 • edited Loading

gszadovszky commented Jun 20, 2018

rdblue commented Jun 22, 2018

zivanfi left a comment

Choose a reason for hiding this comment

rdblue commented Jun 26, 2018

pwais commented Jul 4, 2019 • edited Loading

raymondchen-byte commented Jan 14, 2020

panthony commented Aug 28, 2020 • edited Loading

livelace commented Nov 29, 2020

panthony commented Nov 30, 2020 • edited Loading

rdblue commented Jun 20, 2018 •

edited

Loading

pwais commented Jul 4, 2019 •

edited

Loading

panthony commented Aug 28, 2020 •

edited

Loading

panthony commented Nov 30, 2020 •

edited

Loading