Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commit a parquet-mr patch that enables writing out row-group sizes smaller than 100 #90

Open
selitvin opened this issue Aug 23, 2018 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@selitvin
Copy link
Collaborator

parquet-hadoop library does not support row-group sizes less then a 100 (PARQUET-409).
Until resolved by Parquet project, we should add a patch (or a reference to a pull request) + build instructions to make it easier for our users to generate parquet files with row groups smaller than a 100.

@selitvin selitvin added the enhancement New feature or request label Aug 23, 2018
@selitvin selitvin changed the title Add a parquet we use to control row-group sizes better Commit a parquet-mr patch that enables writing out row-group sizes smaller than 100 Aug 27, 2018
@ingolfured
Copy link
Contributor

Is there a link to a patch (or even better, reference to a pull request) we can have a look at?

@selitvin
Copy link
Collaborator Author

Don't remember the details since it was a long time ago. Try seeing if any of these references help: apache/parquet-java#470
https://issues.apache.org/jira/browse/PARQUET-409

From my experience, it's typically not a good idea to have parquet stores with small row-groups. It does violate a bunch of assumptions on the parquet store structure and makes you "fight" parquet library implementation a lot. It manifests as poor performance and large memory footprints in some scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants