-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New parquet tools commands #132
base: master
Are you sure you want to change the base?
Conversation
|
||
@Override | ||
public void execute(CommandLine options) throws Exception { | ||
super.execute(options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It looks like you're mixing spaces and tabs. The rest of the project uses 2-space indentation, which would really help the readability of this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All tabs are removed.
Initially, I thought that these should work on a single file the other commands, but it sounds like you have a use case I'm not thinking about and intended for the commands to work that way. I think I can see the value of getting the total row count for a directory, since it would require adding up all of the individual counts from meta or dump. What I'm not sure is useful is the size command -- why is that needed? |
@rdblue , As you said, I built these two commands considering, getting row counts & size of directories/globs containing parquet data assets. We have partitioned data in parquet for hive tables. It will be helpful if I can see total row count & size of complete data with it's breakdown in partitions. I can easily see if my parquet data asset evenly distribution. |
Conflicts: parquet-column/pom.xml parquet-tools/pom.xml
@swapnilushinde, that use case sounds reasonable so let's add them back. Are there other commands that make sense to have a glob also? Someone is adding it to the schema command, see #136. |
Conflicts: parquet-tools/pom.xml
@swapnilushinde thanks! I'll take a look soon-ish. |
Hey @swapnilushinde, @rdblue: A few comments about this approach - a) I really like the -o [<**c**ompressed>|<**u**ncompressed>],[...] e) In fact with the approach mentioned above, we could simplify the implementation a bit - both What do you think? I'm happy to help with the work in implementing ideas we deem useful in a follow up PR if we don't do it all here. |
@prateek Thank for your reply. I agree with you. Overall, I am thinking of opening another PR to work on it. Let's keep this PR as it is so we can get it merged. We could open another PR to implement all above features with some more commands after brainstorming. |
@swapnilushinde Ideally, I'd say we don't commit any of the stuff where I can pick up the parts we leave off here in this PR: 1 On 10 Apr 2015, at 17:57, Swapnil wrote:
|
Why hasn't this been merged? |
@rdblue - It's been long time I worked on this. Let me know if you need any further changes or can be merged directly. |
Hi. |
@rdblue this looks good to go. Any other comments? |
Looks fine to me. |
@rdblue Thank you.. Please let me know if I need to do something. Wanting to get it merged for long time.. |
@Swapnil: please create a PARQUET jira for this and prefix the description with the id: PARQUET-X: ... |
@swapnilushinde gentle nagging on PRs is always fine :). Sometimes if your comment shows up at a busy time it falls through the cracks. Thank you for your contribution. |
@rdblue @julienledem I have created another PR with rebase. |
@swapnilushinde sorry your new PR is on the old repo. use apache/parquet-mr not Parquet/parquet-mr. |
@julienledem Sorry about that. Please find this PR based on apache/parquet-mr repo. |
@julienledem @rdblue : can you please take a look at above PR? |
This is a rebase on already existing PR- #132 Author: Swapnil Shinde <[email protected]> Closes #406 from swapnilushinde/master and squashes the following commits: 59a8980 [Swapnil Shinde] Spacing to conform java style (if/for) is fixed 5fd0279 [Swapnil Shinde] Parquet-196: parquet-tools command for row count & size
this is useful, can we rebase and merge this in? |
This is a rebase on already existing PR- apache/parquet-java#132 Author: Swapnil Shinde <[email protected]> Closes #406 from swapnilushinde/master and squashes the following commits: 59a8980 [Swapnil Shinde] Spacing to conform java style (if/for) is fixed 5fd0279 [Swapnil Shinde] Parquet-196: parquet-tools command for row count & size (cherry picked from commit fd7cfed) Change-Id: I5bf7a27ea1bafa4145fdf1fb25610ded0308ac42
Looks useful, why not merge after resolving conflicts? |
Parquet files contain metadata about rowcount & file size. We should have new commands to get rows count & size.
These command helps us to avoid parsing job logs or loading data once again just to find number of rows in data. This comes very handy in complex process chaining, post processes like stats generation, QA etc.
These command can be added in parquet-tools:
Examples with possible combinations-
Jira ticket-
https://issues.apache.org/jira/browse/PARQUET-196