-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-869: Configurable record counts for block size checks #470
base: master
Are you sure you want to change the base?
PARQUET-869: Configurable record counts for block size checks #470
Conversation
de6c312
to
4d5dd4e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only have the annoying backward compatibility findings.
@@ -45,8 +45,11 @@ | |||
public static final boolean DEFAULT_IS_DICTIONARY_ENABLED = true; | |||
public static final WriterVersion DEFAULT_WRITER_VERSION = WriterVersion.PARQUET_1_0; | |||
public static final boolean DEFAULT_ESTIMATE_ROW_COUNT_FOR_PAGE_SIZE_CHECK = true; | |||
public static final int DEFAULT_MINIMUM_RECORD_COUNT_FOR_CHECK = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if anyone would use such constants but it is a braking change to remove them. It might be a good idea to deprecate them instead and use the new ones internally.
public ValuesWriterFactory getValuesWriterFactory() { | ||
return valuesWriterFactory; | ||
} | ||
|
||
public boolean estimateNextSizeCheck() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest deprecating instead of removing.
@gszadovszky, |
I can accept that but how would the API client know that? We already know some modifications in the "internal API" which caused problems to our clients. |
None of the org.apache.parquet.column classes are public (see https://github.com/apache/parquet-mr/blob/master/pom.xml#L250). I know it is annoying to not have a public API, but I think it is much worse to slow development to maintain compatibility on internal classes than to break the few people who were for some unknown reason using an internal API with little use outside of the project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion precedents of earlier breaking changes do not justify adding more of them. Parquet is a leading file format of big data applications and as such should fully respect the semantic versioning rules for backwards compatibility. We would like breaking changes in Parquet to be taken more seriously and advocate not following the bad example of what was done earlier any more.
Your point that the compatibility of the leaked parts of the API is a pain is true, but I think that we (= Parquet developers) should be the ones who feel this pain and deal with these issues. We should not push this burden onto the developers who consume our library and do not know which parts of the API were intended to be public and which parts were just leaked by accident, because we did not communicate this distinction properly. If anything, the burden of maintaining compatibility should serve as a motivation to define our API more clearly.
We all agreed that the lack of a well-defined API is a problem that can only be fixed properly in the next major release. However, I don't think that means that we should neglect compatibility in the meantime. On the contrary, now that we have set a goal for a proper API, we should limit our breaking changes to a minimum until we get there.
There may be a level of required effort when the cost of maintaining compatibility outweigh its advantages, but in simple cases like this, where remaining backwards compatible takes literally zero effort, I see no reason for introducing a breaking change.
I don't agree with a requirement for full binary compatibility across the entire codebase because we lack a public API. We already have significant drag from maintaining compatibility in the classes that are public and I think it's a bad idea to introduce that problem everywhere. Let's work on a public API if not having one is going to prevent us from making reasonable changes to internal classes. @julienledem, any thoughts on this? |
@zivanfi Will there be any progress on this fix? Note disclosure on https://eng.uber.com/petastorm/ ... sounds like people are forking parquet in order to get around this bug. |
We have a large number of images that need to be stored through parquet, but encountered the above situation, I hope to promote this optimization point as soon as possible, which is very helpful for our work, thanks |
Same here, we have 1 or 2 columns that can vary widely in size (few Kbs up to 10Mb) and we often stumble upon an OutOfMemory error because it didn't check the buffered rows in time. Being able to adjust the checks frequency would be a huge help 👍 I have a rebased branch against master if anyone interested |
Well, I'm glad that I have found this bug before I started to save images into parquet files. |
@livelace If at one point you consider using parquet with binary files (or any big columns) know that increasing the frequency of checks may not be enough (it was not for me). I had to fork parquet to be able to opt-out the computation of statistics for some of my columns, see: https://issues.apache.org/jira/browse/PARQUET-1911 From this moment I never had any OOM issue. |
This PR adds on #447 and updates the properties to use "row group" instead of "block" because block is confusing. It also fixes the outstanding review comments so this can be merged.
Closes #447.