-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
211 lines (157 loc) · 7.57 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
Sort input by count, printing totals and percentages.
Think of it as sort | uniq -c | sort -nr on steroids ;)
Sample output:
$ topuniq --min-count=100 examples/2-icon-types.txt
39564 100.0% Total (8)
25373 64.1% png
12128 30.7% svg
1290 3.3% xpm
685 1.7% icon
88 0.2% Other (4)
A more complex example:
$ topuniq --min-perc=1 examples/3-shebangs.txt \
--total-last --label-total="TOTAL: %d unique shebangs" \
--sort-other --label-other="(other %d unique shebangs)"
330 26.7% #!/bin/sh
148 12.0% #!/usr/bin/perl -w
145 11.7% #!/usr/bin/python
143 11.6% #!/usr/bin/perl
117 9.5% (other 35 unique shebangs)
90 7.3% #! /bin/sh
80 6.5% #!/bin/bash
42 3.4% #!/usr/bin/env python
39 3.2% #! /usr/bin/perl -w
25 2.0% #! /usr/bin/python
22 1.8% #! /usr/bin/perl
21 1.7% #! /bin/bash
20 1.6% #!/bin/sh -e
14 1.1% #! /usr/bin/env perl
1236 100.0% TOTAL: 48 unique shebangs
As a drop-in replacement for cmd | sort | uniq -c | sort -nr
(using cat just to show pipeline usage, I know it is redundant)
$ cat examples/2-icon-types.txt | topuniq --no-total --no-perc
25373 png
12128 svg
1290 xpm
685 icon
53 theme
33 cache
1 txt
1 svgz
"Enhancing" previously saved data generated by cmd | sort | uniq -c | sort -nr
(yes, lame and cheesy option name, but I could not think of a better one...)
$ topuniq --enhance-uniq --top=10 examples/4-shebangs-preprocessed.txt
1236 100.0% Total (53)
328 26.5% #!/bin/sh
146 11.8% #!/usr/bin/perl -w
145 11.7% #!/usr/bin/python
141 11.4% #!/usr/bin/perl
90 7.3% #! /bin/sh
80 6.5% #!/bin/bash
42 3.4% #!/usr/bin/env python
39 3.2% #! /usr/bin/perl -w
25 2.0% #! /usr/bin/python
21 1.7% #! /usr/bin/perl
179 14.5% Other (43)
Performance comparisons with sort | uniq -c | sort -nr
(always using the 41277 lines, 235KB examples/1-man-bash-words.txt, average of
3 runs of 'time' in a 100 iterations loop)
Reference:
sort | uniq -c | sort -nr: real 0m10.042s
Worst case scenario - no min-* or top-* filter
topuniq real 0m14.360s (gawk)
real 0m13.294s (mawk)
Direct comparison - no-op same output as reference
(no, I didn't optimize for that... yet ;)
topuniq --no-total --no-perc real 0m14.201s (gawk)
real 0m13.252s (mawk)
Best case scenario - using min-count > total
(not cheating with --stop-after-*, of course)
topuniq --min-count=3000 real 0m11.797s (gawk)
real 0m11.739s (mawk)
Not bad, not bad at all ;)
... and soon to be hugely improved.
Wishlist:
(A.K.A. "Things I would add if I did not fear bloat and feature-creep)
- Optimize for some common option combinations:
--no-perc + no --min-perc : do not calculate percentages at all
--no-other: do not update *['other'] arrays
--no-total + --no-perc + no filters: skip awk entirely ;)
--enhance-uniq: skip last sort -nr
- Add position column, and --no-pos option. Very useful for long lists, but
nothing grep -n or pasting to an editor can't do. Position would be blank
for <other>, even if sorted.
- Add yet another percentage: position %, same value --top-perc uses to filter
To answer the question "what does being #15 in this list mean?". Besides,
I already calculate it, so why not show it? ;) --no/show--pos-perc
- Add 2 more percentages: cumulative % of lines above (Up) and below (Down).
Useful for analyzing thresholds. --no-perc-up and --no-perc-down to disable
(maybe --no-percsum-*? Anyway, --show-* to enable if not default)
% down would of course also count lines filtered in <other> and not printed.
Example: 40: 145 0.4% 56.2% 43.4% bash
- This is starting to look like a spreadsheet, so I'd better add headers.
Optional (--show-header) and customizable, of course.
- Request this sweet, useful tool to be included in Debian?
So you think any of these features are worth having? Leave a comment, or ask
for them in "Issues". I would gladly add them in next release!
Full manual, from --help:
Usage: topuniq [options] [FILE...]
If FILE is not given, read from standard input. For numeric input
options, NUM must be a positive integer (digits only). All options
requiring arguments accept both --option=ARG or --option ARG forms
Options not listed here, if any, are appended to uniq -c
Options:
-h|--help show this page.
--min-count=NUM only print lines with count >= NUM
--min-perc=NUM only print lines with count percent >= NUM%
--top=NUM only print the top NUM lines. 0 = all lines
--top-perc=NUM only print the top NUM% lines
All lines with count less than any of the above options will be
grouped together as a single <other> line, printed last by default.
Setting a minimum higher than total, either count or percentage,
will effectively disable printing the <total> line. For --top-*
options, NUM does not include the total.
--stop-after-top=NUM stop reading after NUM top unique lines
--stop-after-count=NUM stop reading after lines with count < NUM
Unlike --min-* and --top-* options, the above will discard lines,
thus affecting <total>, <other> and all percentages.
--stop-after-top is equivalent to 'head -nNUM' after sort -nr and
before topuniq's enhancements. For both, NUM=0 disables the option
--precision=NUM use NUM decimal digits for the percentages,
default 1
--no-perc do not print percentages
--no-total do not print <total> line
--no-other do not print <other> line
--total-last print <total> line last instead of first
--sort-other print <other> line in sorted position
--label-total=LABEL use LABEL for <total> line, default "Total (%d)"
--label-other=LABEL use LABEL for <other> line, default "Other (%d)"
For the --label-* options, optional "%d" prints the number of unique
lines that <total> or <other> represents
--enhance-uniq consider input as already processed by
sort | uniq -c, skip it and process from there.
Useful for enhancing previously saved data
Environment Variables:
topuniq uses sort and uniq, so the user locale, particularly
LC_COLLATE, affects ordering and unique matching, as well as sort
performance. LC_NUMERIC affects decimal separator when printing
percentages. Use LC_ALL=C for the fastest and locale-independent
results.
Examples:
# Ignore lines with count < 10%, using case-insensitive uniq
topuniq --min-perc=10 --no-other --ignore-case
# Top 20, sorting <others> within the list, and customizing its label
topuniq --top=20 --sort-other --label-other="Other %d unique lines"
# Enhance an existing input, discarding lines with count < 10
topuniq my_uniq_data.txt --enhance-uniq --stop-after-count=10
# Behaves exactly like sort | uniq -c | sort -nr
topuniq --no-total --no-perc
For input data, some examples you may pipe directly to topuniq:
# Words in Bash's manual page
man bash | tr '[:punct:][:blank:]' '\n' | sed '/^$/d'
# Icon types in /usr/share/icons
find /usr/share/icons -type f -name "*.*" | awk -F. '{print $NF}'
# Shebangs from /usr/bin scripts
for f in /usr/bin/*; do [ -f "" ] && head -n1 "" | grep ^#!; done
Copyright (C) 2012 Rodrigo Silva (MestreLion) <[email protected]>
License: GPLv3 or later. See <http://www.gnu.org/licenses/gpl.html>