New parquet tools commands #132

swapnilushinde · 2015-03-04T17:34:15Z

Parquet files contain metadata about rowcount & file size. We should have new commands to get rows count & size.
These command helps us to avoid parsing job logs or loading data once again just to find number of rows in data. This comes very handy in complex process chaining, post processes like stats generation, QA etc.

These command can be added in parquet-tools:

rowcount : This command gives row count in parquet input. It adds up row counts of all all files matching hadoop glob pattern. Use with option 'd' give detailed rows count of each file matching input pattern.
Examples with possible combinations-

-- Row count without pattern
 hadoop jar ./parquet-tools-1.6.0rc4.jar rowcount /abc/xyz/week_id_partition=748 
Total RowCount: 2425763803

-- Row count with pattern (non detailed)
 hadoop jar ./parquet-tools-1.6.0rc4.jar rowcount /abc/xyz/week_id_partition=74*
Total RowCount: 4781060690

-- Row count detailed for pattern
 hadoop jar ./parquet-tools-1.6.0rc4.jar rowcount -d /abc/xyz/week_id_partition=74*
week_id_partition=748 row count: 2425763803
week_id_partition=749 row count: 2355296887
Total RowCount: 4781060690

size : This command gives size of parquet date with multiple options
pretty, p : Human readable size
uncompressed, u : Get uncompressed size
detailed, d : Detailed sizes for each matching parquet file with summary

-- compressed bytes without pattern
hadoop jar ./parquet-tools-1.6.0rc4.jar size /abc/xyz/week_id_partition=748
Total Size: 18452348360 bytes

-- compressed human readable size without pattern
hadoop jar ./parquet-tools-1.6.0rc4.jar size -p /abc/xyz/week_id_partition=748
Total Size: 17.355 GB

-- uncompressed human readable size without pattern
 hadoop jar ./parquet-tools-1.6.0rc4.jar size -uncompressed -pretty /abc/xyz/week_id_partition=748
Total Size: 102.505 GB



-- uncompressed detailed human readable size
hadoop jar ./parquet-tools-1.6.0rc4.jar size -uncompressed -pretty -d /abc/xyz/week_id_partition=74*
week_id_partition=748: 102.505 GB
week_id_partition=749: 99.167 GB
Total Size: 201.671 GB

-- compressed human readable size summary
hadoop jar ./parquet-tools-1.6.0rc4.jar size -pretty /abc/xyz/week_id_partition=74*
Total Size: 34.169 GB

-- uncompressed bytes in detailed
hadoop jar ./parquet-tools-1.6.0rc4.jar size -uncompressed -d  /abc/xyz/week_id_partition=74*
week_id_partition=748: 108988759585 bytes
week_id_partition=749: 105439653433 bytes
Total Size: 214428413018 bytes

Jira ticket-
https://issues.apache.org/jira/browse/PARQUET-196

rdblue · 2015-03-05T01:26:05Z

parquet-tools/src/main/java/parquet/tools/command/RowCountCommand.java

+
+  @Override
+  public void execute(CommandLine options) throws Exception {
+	super.execute(options);


Nit: It looks like you're mixing spaces and tabs. The rest of the project uses 2-space indentation, which would really help the readability of this code.

All tabs are removed.

rdblue · 2015-03-06T16:53:23Z

size command is still expecting glob path. I feel it is helpful but let me know if you find it otherwise.

Initially, I thought that these should work on a single file the other commands, but it sounds like you have a use case I'm not thinking about and intended for the commands to work that way. I think I can see the value of getting the total row count for a directory, since it would require adding up all of the individual counts from meta or dump. What I'm not sure is useful is the size command -- why is that needed?

swapnilushinde · 2015-03-06T17:46:53Z

@rdblue , As you said, I built these two commands considering, getting row counts & size of directories/globs containing parquet data assets. We have partitioned data in parquet for hive tables. It will be helpful if I can see total row count & size of complete data with it's breakdown in partitions. I can easily see if my parquet data asset evenly distribution.
Having option to get detailed row count & size with glob option helps in QA steps, exporting data to other DBs etc.
I wanted to build commands which will give all of above but default working like existing commands(for parquet file). Row count & size are different than existing commands as they can be expected to work for more than one file.
Let me know if this aligns with project goals. I will rewrite based on your comments.

Conflicts: parquet-column/pom.xml parquet-tools/pom.xml

rdblue · 2015-03-11T22:30:33Z

@swapnilushinde, that use case sounds reasonable so let's add them back.

Are there other commands that make sense to have a glob also? Someone is adding it to the schema command, see #136.

Conflicts: parquet-tools/pom.xml

swapnilushinde · 2015-03-16T03:46:03Z

@rdblue I have added back those changes. I think other commands don't need glob except #136 which is already done..

rdblue · 2015-03-23T23:56:14Z

@swapnilushinde thanks! I'll take a look soon-ish.

prateek · 2015-04-10T14:26:21Z

Hey @swapnilushinde, @rdblue: A few comments about this approach -

a) I really like the glob idea. I think it solves a definite use-case and should stay in there.
b) For both rowcount & size, one additional feature I'd like is the ability to specify the entity for which to list stats. Note that -d doesn't list the summary per file, it lists details per directory or file which matches the glob pattern. The default behavior make sense to me (without -d); with -d, I'd like the ability to specify if we want the summary per glob pattern, per file in the glob pattern, or further more - per row group within each file in the glob pattern. The intended use-case for the stats within the different row groups would be to understand how the data is being aligned v hdfs block size. Essentially, diagnosing the issues mentioned here - 1. As for implementation, we could stick to -d [<detail-depth>] or go -d vs -dd vs -ddd. Either works and feels unix-y enough for my taste.
c) How about adding a flag to compute a statistical summary of the raw results displayed by the commands. Operating on the same level of detail as specified by the -d proposed above. Even a simple min/max/avg/std.dev would go a long way to understanding the distribution.
d) I really like the -u flag you have in size. I'd propose to have the same functionality slightly differently. Instead of the forcing the user to pick between displaying compressed and uncompressed sizes, we should take a list of args as input and display each stat request. I'm thinking the analog of ps -o, where you can specify which columns you'd like to output per pid. In our case, I imagine it to be something like:

-o [<**c**ompressed>|<**u**ncompressed>],[...]

e) In fact with the approach mentioned above, we could simplify the implementation a bit - both rowcount & size could be aliases to an alternative command, lets say stats. This would take the analogous -o flag, along with another input type to indicate rowcount.
f) A minor nit, I found this implementation of making a byte size human readable on SO - 2, it's nifty.

What do you think? I'm happy to help with the work in implementing ideas we deem useful in a follow up PR if we don't do it all here.

swapnilushinde · 2015-04-10T21:57:39Z

@prateek Thank for your reply. I agree with you.
Please find my comments below-
b) We can add options or extra arguments to get rowcount and summary on file level and/or row group level. I am assuming you just want to see row counts or size per file or row group.
c) I can see advantages of having statistical summary of numerical columns. Not sure how it can be done with parquet meta data but will be interesting :)
d) Yes. Current size command gives either compressed or uncompressed size.
f) We can change size implementation to use log and exponent instead of if else clauses. No performance gain but looks nifty !!

Overall, I am thinking of opening another PR to work on it. Let's keep this PR as it is so we can get it merged. We could open another PR to implement all above features with some more commands after brainstorming.
@rdblue What are your thoughts on it? I prefer to get this PR merged and work on above features and few more in new PR.

prateek · 2015-04-13T19:07:26Z

@swapnilushinde Ideally, I'd say we don't commit any of the stuff where
we know we are going to change the interface on the cli, I think that
means we would make the changes for (d) and then do the rest in another
ticket. That said, I don't know how big a deal it is to change the
interface. @rdblue your call.

I can pick up the parts we leave off here in this PR: 1

On 10 Apr 2015, at 17:57, Swapnil wrote:

@prateek Thank for your reply. I agree with you.
Please find my comments below-
b) We can add options or extra arguments to get rowcount and summary
on file level and/or row group level. I am assuming you just want to
see row counts or size per file or row group.
c) I can see advantages of having statistical summary of numerical
columns. Not sure how it can be done with parquet meta data but will
be interesting :)
d) Yes. Current size command gives either compressed or uncompressed
size.
f) We can change size implementation to use log and exponent instead
of if else clauses. No performance gain but looks nifty !!

Overall, I am thinking of opening another PR to work on it. Let's keep
this PR as it is so we can get it merged. We could open another PR to
implement all above features with some more commands after
brainstorming.
@rdblue What are your thoughts on it? I prefer to get this PR merged
and work on above features and few more in new PR.

Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-parquet-mr/pull/132#issuecomment-91703736

swapnilushinde · 2015-04-28T14:38:22Z

@prateek @rdblue
Hello guys, sorry i was busy with other stuff last few weeks so couldn't pay attention on it. @rdblue, Could you please let us know what new changes/features you want in #132 so we can make commit sooner. As pratheek mentioned, we can work on more advanced features in separate PR.

kadwanev · 2016-04-28T14:25:44Z

Why hasn't this been merged?

swapnilushinde · 2016-04-28T18:26:46Z

@rdblue - It's been long time I worked on this. Let me know if you need any further changes or can be merged directly.

Lucas-C · 2017-02-16T15:57:09Z

Hi.
This looks really useful !
Could it be merged please ?

julienledem · 2017-02-16T16:44:52Z

@rdblue this looks good to go. Any other comments?

rdblue · 2017-02-16T16:51:31Z

Looks fine to me.

swapnilushinde · 2017-02-16T19:03:54Z

@rdblue Thank you.. Please let me know if I need to do something. Wanting to get it merged for long time..

julienledem · 2017-02-16T19:28:17Z

@Swapnil: please create a PARQUET jira for this and prefix the description with the id: PARQUET-X: ...
Also rebase your branch.
Thank you.
When this is done, I'll merge

julienledem · 2017-02-16T19:30:07Z

@swapnilushinde gentle nagging on PRs is always fine :). Sometimes if your comment shows up at a busy time it falls through the cracks. Thank you for your contribution.

swapnilushinde · 2017-02-17T16:51:36Z

@rdblue @julienledem I have created another PR with rebase.
Here is the jira ticket-
https://issues.apache.org/jira/browse/PARQUET-196
PR-
https://github.com/Parquet/parquet-mr/pull/460

julienledem · 2017-02-18T01:41:18Z

@swapnilushinde sorry your new PR is on the old repo. use apache/parquet-mr not Parquet/parquet-mr.
(merging master in your branch is fine too since we'll squash in the end)

swapnilushinde · 2017-02-23T19:50:07Z

@julienledem Sorry about that. Please find this PR based on apache/parquet-mr repo.
PR: #406
Here is the jira ticket-
https://issues.apache.org/jira/browse/PARQUET-196

swapnilushinde · 2017-03-01T21:44:34Z

@julienledem @rdblue : can you please take a look at above PR?

This is a rebase on already existing PR- #132 Author: Swapnil Shinde <[email protected]> Closes #406 from swapnilushinde/master and squashes the following commits: 59a8980 [Swapnil Shinde] Spacing to conform java style (if/for) is fixed 5fd0279 [Swapnil Shinde] Parquet-196: parquet-tools command for row count & size

ghost · 2018-04-25T17:14:42Z

this is useful, can we rebase and merge this in?

swapnilushinde · 2018-04-29T17:00:21Z

@Jokomo This has been rebased and merges in different PR -
#406

This is a rebase on already existing PR- apache/parquet-java#132 Author: Swapnil Shinde <[email protected]> Closes #406 from swapnilushinde/master and squashes the following commits: 59a8980 [Swapnil Shinde] Spacing to conform java style (if/for) is fixed 5fd0279 [Swapnil Shinde] Parquet-196: parquet-tools command for row count & size (cherry picked from commit fd7cfed) Change-Id: I5bf7a27ea1bafa4145fdf1fb25610ded0308ac42

meetchandan · 2019-03-29T19:42:05Z

Looks useful, why not merge after resolving conflicts?

unknown added 2 commits March 4, 2015 01:03

New commands to get row count & file sizes matching glob pattern.

f910f6f

Scripts for rowcount & size commands

75e18f1

rdblue reviewed Mar 5, 2015
View reviewed changes

Changes asked by committer.

26a8f88

unknown added 2 commits March 7, 2015 11:20

commit before merge--dont push

6e2bb54

Merge remote-tracking branch 'upstream/master'

972f9ac

Conflicts: parquet-column/pom.xml parquet-tools/pom.xml

unknown added 3 commits March 15, 2015 00:08

commit before pull

b795463

Merge branch 'master' of https://github.com/apache/incubator-parquet-mr

0feb74a

Conflicts: parquet-tools/pom.xml

row count command with globs

e7e96cf

swapnilushinde closed this Mar 16, 2015

swapnilushinde reopened this Mar 16, 2015

rdblue mentioned this pull request Apr 9, 2015

PARQUET-196: parquet-tools command to get rowcount & size #172

Open

swapnilushinde mentioned this pull request Feb 22, 2017

PARQUET-196: parquet-tools command for row count & size #406

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New parquet tools commands #132

New parquet tools commands #132

swapnilushinde commented Mar 4, 2015

rdblue Mar 5, 2015

swapnilushinde Mar 6, 2015

rdblue commented Mar 6, 2015

swapnilushinde commented Mar 6, 2015

rdblue commented Mar 11, 2015

swapnilushinde commented Mar 16, 2015

rdblue commented Mar 23, 2015

prateek commented Apr 10, 2015

swapnilushinde commented Apr 10, 2015

prateek commented Apr 13, 2015

swapnilushinde commented Apr 28, 2015

kadwanev commented Apr 28, 2016

swapnilushinde commented Apr 28, 2016

Lucas-C commented Feb 16, 2017

julienledem commented Feb 16, 2017

rdblue commented Feb 16, 2017

swapnilushinde commented Feb 16, 2017

julienledem commented Feb 16, 2017

julienledem commented Feb 16, 2017

swapnilushinde commented Feb 17, 2017

julienledem commented Feb 18, 2017

swapnilushinde commented Feb 23, 2017

swapnilushinde commented Mar 1, 2017

ghost commented Apr 25, 2018

swapnilushinde commented Apr 29, 2018

meetchandan commented Mar 29, 2019

New parquet tools commands #132

Are you sure you want to change the base?

New parquet tools commands #132

Conversation

swapnilushinde commented Mar 4, 2015

rdblue Mar 5, 2015

Choose a reason for hiding this comment

swapnilushinde Mar 6, 2015

Choose a reason for hiding this comment

rdblue commented Mar 6, 2015

swapnilushinde commented Mar 6, 2015

rdblue commented Mar 11, 2015

swapnilushinde commented Mar 16, 2015

rdblue commented Mar 23, 2015

prateek commented Apr 10, 2015

swapnilushinde commented Apr 10, 2015

prateek commented Apr 13, 2015

swapnilushinde commented Apr 28, 2015

kadwanev commented Apr 28, 2016

swapnilushinde commented Apr 28, 2016

Lucas-C commented Feb 16, 2017

julienledem commented Feb 16, 2017

rdblue commented Feb 16, 2017

swapnilushinde commented Feb 16, 2017

julienledem commented Feb 16, 2017

julienledem commented Feb 16, 2017

swapnilushinde commented Feb 17, 2017

julienledem commented Feb 18, 2017

swapnilushinde commented Feb 23, 2017

swapnilushinde commented Mar 1, 2017

ghost commented Apr 25, 2018

swapnilushinde commented Apr 29, 2018

meetchandan commented Mar 29, 2019