Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New parquet tools commands #132

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

New parquet tools commands #132

wants to merge 8 commits into from

Conversation

swapnilushinde
Copy link
Contributor

Parquet files contain metadata about rowcount & file size. We should have new commands to get rows count & size.
These command helps us to avoid parsing job logs or loading data once again just to find number of rows in data. This comes very handy in complex process chaining, post processes like stats generation, QA etc.

These command can be added in parquet-tools:

  • rowcount : This command gives row count in parquet input. It adds up row counts of all all files matching hadoop glob pattern. Use with option 'd' give detailed rows count of each file matching input pattern.
    Examples with possible combinations-
-- Row count without pattern
 hadoop jar ./parquet-tools-1.6.0rc4.jar rowcount /abc/xyz/week_id_partition=748 
Total RowCount: 2425763803

-- Row count with pattern (non detailed)
 hadoop jar ./parquet-tools-1.6.0rc4.jar rowcount /abc/xyz/week_id_partition=74*
Total RowCount: 4781060690

-- Row count detailed for pattern
 hadoop jar ./parquet-tools-1.6.0rc4.jar rowcount -d /abc/xyz/week_id_partition=74*
week_id_partition=748 row count: 2425763803
week_id_partition=749 row count: 2355296887
Total RowCount: 4781060690
  • size : This command gives size of parquet date with multiple options
  • pretty, p : Human readable size
  • uncompressed, u : Get uncompressed size
  • detailed, d : Detailed sizes for each matching parquet file with summary
-- compressed bytes without pattern
hadoop jar ./parquet-tools-1.6.0rc4.jar size /abc/xyz/week_id_partition=748
Total Size: 18452348360 bytes

-- compressed human readable size without pattern
hadoop jar ./parquet-tools-1.6.0rc4.jar size -p /abc/xyz/week_id_partition=748
Total Size: 17.355 GB

-- uncompressed human readable size without pattern
 hadoop jar ./parquet-tools-1.6.0rc4.jar size -uncompressed -pretty /abc/xyz/week_id_partition=748
Total Size: 102.505 GB



-- uncompressed detailed human readable size
hadoop jar ./parquet-tools-1.6.0rc4.jar size -uncompressed -pretty -d /abc/xyz/week_id_partition=74*
week_id_partition=748: 102.505 GB
week_id_partition=749: 99.167 GB
Total Size: 201.671 GB

-- compressed human readable size summary
hadoop jar ./parquet-tools-1.6.0rc4.jar size -pretty /abc/xyz/week_id_partition=74*
Total Size: 34.169 GB

-- uncompressed bytes in detailed
hadoop jar ./parquet-tools-1.6.0rc4.jar size -uncompressed -d  /abc/xyz/week_id_partition=74*
week_id_partition=748: 108988759585 bytes
week_id_partition=749: 105439653433 bytes
Total Size: 214428413018 bytes

Jira ticket-
https://issues.apache.org/jira/browse/PARQUET-196


@Override
public void execute(CommandLine options) throws Exception {
super.execute(options);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It looks like you're mixing spaces and tabs. The rest of the project uses 2-space indentation, which would really help the readability of this code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tabs are removed.

@rdblue
Copy link
Contributor

rdblue commented Mar 6, 2015

size command is still expecting glob path. I feel it is helpful but let me know if you find it otherwise.

Initially, I thought that these should work on a single file the other commands, but it sounds like you have a use case I'm not thinking about and intended for the commands to work that way. I think I can see the value of getting the total row count for a directory, since it would require adding up all of the individual counts from meta or dump. What I'm not sure is useful is the size command -- why is that needed?

@swapnilushinde
Copy link
Contributor Author

@rdblue , As you said, I built these two commands considering, getting row counts & size of directories/globs containing parquet data assets. We have partitioned data in parquet for hive tables. It will be helpful if I can see total row count & size of complete data with it's breakdown in partitions. I can easily see if my parquet data asset evenly distribution.
Having option to get detailed row count & size with glob option helps in QA steps, exporting data to other DBs etc.
I wanted to build commands which will give all of above but default working like existing commands(for parquet file). Row count & size are different than existing commands as they can be expected to work for more than one file.
Let me know if this aligns with project goals. I will rewrite based on your comments.

unknown added 2 commits March 7, 2015 11:20
Conflicts:
	parquet-column/pom.xml
	parquet-tools/pom.xml
@rdblue
Copy link
Contributor

rdblue commented Mar 11, 2015

@swapnilushinde, that use case sounds reasonable so let's add them back.

Are there other commands that make sense to have a glob also? Someone is adding it to the schema command, see #136.

@swapnilushinde
Copy link
Contributor Author

@rdblue I have added back those changes. I think other commands don't need glob except #136 which is already done..

@rdblue
Copy link
Contributor

rdblue commented Mar 23, 2015

@swapnilushinde thanks! I'll take a look soon-ish.

@prateek
Copy link

prateek commented Apr 10, 2015

Hey @swapnilushinde, @rdblue: A few comments about this approach -

a) I really like the glob idea. I think it solves a definite use-case and should stay in there.
b) For both rowcount & size, one additional feature I'd like is the ability to specify the entity for which to list stats. Note that -d doesn't list the summary per file, it lists details per directory or file which matches the glob pattern. The default behavior make sense to me (without -d); with -d, I'd like the ability to specify if we want the summary per glob pattern, per file in the glob pattern, or further more - per row group within each file in the glob pattern. The intended use-case for the stats within the different row groups would be to understand how the data is being aligned v hdfs block size. Essentially, diagnosing the issues mentioned here - 1. As for implementation, we could stick to -d [<detail-depth>] or go -d vs -dd vs -ddd. Either works and feels unix-y enough for my taste.
c) How about adding a flag to compute a statistical summary of the raw results displayed by the commands. Operating on the same level of detail as specified by the -d proposed above. Even a simple min/max/avg/std.dev would go a long way to understanding the distribution.
d) I really like the -u flag you have in size. I'd propose to have the same functionality slightly differently. Instead of the forcing the user to pick between displaying compressed and uncompressed sizes, we should take a list of args as input and display each stat request. I'm thinking the analog of ps -o, where you can specify which columns you'd like to output per pid. In our case, I imagine it to be something like:

-o [<**c**ompressed>|<**u**ncompressed>],[...]

e) In fact with the approach mentioned above, we could simplify the implementation a bit - both rowcount & size could be aliases to an alternative command, lets say stats. This would take the analogous -o flag, along with another input type to indicate rowcount.
f) A minor nit, I found this implementation of making a byte size human readable on SO - 2, it's nifty.

What do you think? I'm happy to help with the work in implementing ideas we deem useful in a follow up PR if we don't do it all here.

@swapnilushinde
Copy link
Contributor Author

@prateek Thank for your reply. I agree with you.
Please find my comments below-
b) We can add options or extra arguments to get rowcount and summary on file level and/or row group level. I am assuming you just want to see row counts or size per file or row group.
c) I can see advantages of having statistical summary of numerical columns. Not sure how it can be done with parquet meta data but will be interesting :)
d) Yes. Current size command gives either compressed or uncompressed size.
f) We can change size implementation to use log and exponent instead of if else clauses. No performance gain but looks nifty !!

Overall, I am thinking of opening another PR to work on it. Let's keep this PR as it is so we can get it merged. We could open another PR to implement all above features with some more commands after brainstorming.
@rdblue What are your thoughts on it? I prefer to get this PR merged and work on above features and few more in new PR.

@prateek
Copy link

prateek commented Apr 13, 2015

@swapnilushinde Ideally, I'd say we don't commit any of the stuff where
we know we are going to change the interface on the cli, I think that
means we would make the changes for (d) and then do the rest in another
ticket. That said, I don't know how big a deal it is to change the
interface. @rdblue your call.

I can pick up the parts we leave off here in this PR: 1

On 10 Apr 2015, at 17:57, Swapnil wrote:

@prateek Thank for your reply. I agree with you.
Please find my comments below-
b) We can add options or extra arguments to get rowcount and summary
on file level and/or row group level. I am assuming you just want to
see row counts or size per file or row group.
c) I can see advantages of having statistical summary of numerical
columns. Not sure how it can be done with parquet meta data but will
be interesting :)
d) Yes. Current size command gives either compressed or uncompressed
size.
f) We can change size implementation to use log and exponent instead
of if else clauses. No performance gain but looks nifty !!

Overall, I am thinking of opening another PR to work on it. Let's keep
this PR as it is so we can get it merged. We could open another PR to
implement all above features with some more commands after
brainstorming.
@rdblue What are your thoughts on it? I prefer to get this PR merged
and work on above features and few more in new PR.


Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-parquet-mr/pull/132#issuecomment-91703736

@swapnilushinde
Copy link
Contributor Author

@prateek @rdblue
Hello guys, sorry i was busy with other stuff last few weeks so couldn't pay attention on it. @rdblue, Could you please let us know what new changes/features you want in #132 so we can make commit sooner. As pratheek mentioned, we can work on more advanced features in separate PR.

@kadwanev
Copy link

Why hasn't this been merged?

@swapnilushinde
Copy link
Contributor Author

@rdblue - It's been long time I worked on this. Let me know if you need any further changes or can be merged directly.

@Lucas-C
Copy link

Lucas-C commented Feb 16, 2017

Hi.
This looks really useful !
Could it be merged please ?

@julienledem
Copy link
Member

@rdblue this looks good to go. Any other comments?

@rdblue
Copy link
Contributor

rdblue commented Feb 16, 2017

Looks fine to me.

@swapnilushinde
Copy link
Contributor Author

@rdblue Thank you.. Please let me know if I need to do something. Wanting to get it merged for long time..

@julienledem
Copy link
Member

@Swapnil: please create a PARQUET jira for this and prefix the description with the id: PARQUET-X: ...
Also rebase your branch.
Thank you.
When this is done, I'll merge

@julienledem
Copy link
Member

@swapnilushinde gentle nagging on PRs is always fine :). Sometimes if your comment shows up at a busy time it falls through the cracks. Thank you for your contribution.

@swapnilushinde
Copy link
Contributor Author

@rdblue @julienledem I have created another PR with rebase.
Here is the jira ticket-
https://issues.apache.org/jira/browse/PARQUET-196
PR-
https://github.com/Parquet/parquet-mr/pull/460

@julienledem
Copy link
Member

@swapnilushinde sorry your new PR is on the old repo. use apache/parquet-mr not Parquet/parquet-mr.
(merging master in your branch is fine too since we'll squash in the end)

@swapnilushinde
Copy link
Contributor Author

@julienledem Sorry about that. Please find this PR based on apache/parquet-mr repo.
PR: #406
Here is the jira ticket-
https://issues.apache.org/jira/browse/PARQUET-196

@swapnilushinde
Copy link
Contributor Author

@julienledem @rdblue : can you please take a look at above PR?

asfgit pushed a commit that referenced this pull request May 12, 2017
This is a rebase on already existing PR-
#132

Author: Swapnil Shinde <[email protected]>

Closes #406 from swapnilushinde/master and squashes the following commits:

59a8980 [Swapnil Shinde] Spacing to conform java style (if/for) is fixed
5fd0279 [Swapnil Shinde] Parquet-196: parquet-tools command for row count & size
@ghost
Copy link

ghost commented Apr 25, 2018

this is useful, can we rebase and merge this in?

@swapnilushinde
Copy link
Contributor Author

@Jokomo This has been rebased and merges in different PR -
#406

cloudera-hudson pushed a commit to cloudera/parquet-mr that referenced this pull request Mar 29, 2019
This is a rebase on already existing PR-
apache/parquet-java#132

Author: Swapnil Shinde <[email protected]>

Closes #406 from swapnilushinde/master and squashes the following commits:

59a8980 [Swapnil Shinde] Spacing to conform java style (if/for) is fixed
5fd0279 [Swapnil Shinde] Parquet-196: parquet-tools command for row count & size

(cherry picked from commit fd7cfed)

Change-Id: I5bf7a27ea1bafa4145fdf1fb25610ded0308ac42
@meetchandan
Copy link

Looks useful, why not merge after resolving conflicts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants