README

Sort input by count, printing totals and percentages.

Think of it as sort | uniq -c | sort -nr on steroids ;)

Sample output:

$ topuniq --min-count=100 examples/2-icon-types.txt
  39564 100.0% Total (8)
  25373  64.1% png
  12128  30.7% svg
   1290   3.3% xpm
    685   1.7% icon
     88   0.2% Other (4)


A more complex example:

$ topuniq --min-perc=1 examples/3-shebangs.txt \
          --total-last --label-total="TOTAL: %d unique shebangs" \
          --sort-other --label-other="(other %d unique shebangs)"
    330  26.7% #!/bin/sh
    148  12.0% #!/usr/bin/perl -w
    145  11.7% #!/usr/bin/python
    143  11.6% #!/usr/bin/perl
    117   9.5% (other 35 unique shebangs)
     90   7.3% #! /bin/sh
     80   6.5% #!/bin/bash
     42   3.4% #!/usr/bin/env python
     39   3.2% #! /usr/bin/perl -w
     25   2.0% #! /usr/bin/python
     22   1.8% #! /usr/bin/perl
     21   1.7% #! /bin/bash
     20   1.6% #!/bin/sh -e
     14   1.1% #! /usr/bin/env perl
   1236 100.0% TOTAL: 48 unique shebangs


As a drop-in replacement for cmd | sort | uniq -c | sort -nr
(using cat just to show pipeline usage, I know it is redundant)

$ cat examples/2-icon-types.txt | topuniq --no-total --no-perc
  25373 png
  12128 svg
   1290 xpm
    685 icon
     53 theme
     33 cache
      1 txt
      1 svgz


"Enhancing" previously saved data generated by cmd | sort | uniq -c | sort -nr
(yes, lame and cheesy option name, but I could not think of a better one...)

$ topuniq --enhance-uniq --top=10 examples/4-shebangs-preprocessed.txt
   1236 100.0% Total (53)
    328  26.5% #!/bin/sh
    146  11.8% #!/usr/bin/perl -w
    145  11.7% #!/usr/bin/python
    141  11.4% #!/usr/bin/perl
     90   7.3% #! /bin/sh
     80   6.5% #!/bin/bash
     42   3.4% #!/usr/bin/env python
     39   3.2% #! /usr/bin/perl -w
     25   2.0% #! /usr/bin/python
     21   1.7% #! /usr/bin/perl
    179  14.5% Other (43)


Performance comparisons with sort | uniq -c | sort -nr
(always using the 41277 lines, 235KB examples/1-man-bash-words.txt, average of
3 runs of 'time' in a 100 iterations loop)

Reference:
sort | uniq -c | sort -nr:                         real	0m10.042s

Worst case scenario - no min-* or top-* filter
topuniq                                            real	0m14.360s (gawk)
                                                   real	0m13.294s (mawk)

Direct comparison - no-op same output as reference
(no, I didn't optimize for that... yet ;)
topuniq --no-total --no-perc                       real	0m14.201s (gawk)
                                                   real	0m13.252s (mawk)

Best case scenario - using min-count > total
(not cheating with --stop-after-*, of course)
topuniq --min-count=3000                           real	0m11.797s (gawk)
                                                   real	0m11.739s (mawk)

Not bad, not bad at all ;)
... and soon to be hugely improved.

Wishlist:
(A.K.A. "Things I would add if I did not fear bloat and feature-creep)

- Optimize for some common option combinations:
	--no-perc + no --min-perc : do not calculate percentages at all
	--no-other: do not update *['other'] arrays
	--no-total + --no-perc + no filters: skip awk entirely ;)
	--enhance-uniq: skip last sort -nr

- Add position column, and --no-pos option. Very useful for long lists, but
  nothing grep -n or pasting to an editor can't do. Position would be blank
  for <other>, even if sorted.

- Add yet another percentage: position %, same value --top-perc uses to filter
  To answer the question "what does being #15 in this list mean?". Besides,
  I already calculate it, so why not show it? ;) --no/show--pos-perc

- Add 2 more percentages: cumulative % of lines above (Up) and below (Down).
  Useful for analyzing thresholds. --no-perc-up and --no-perc-down to disable
  (maybe --no-percsum-*? Anyway, --show-* to enable if not default)
  % down would of course also count lines filtered in <other> and not printed.
  Example:  40:    145   0.4%  56.2%  43.4% bash

- This is starting to look like a spreadsheet, so I'd better add headers.
  Optional (--show-header) and customizable, of course.

- Request this sweet, useful tool to be included in Debian?

So you think any of these features are worth having? Leave a comment, or ask
for them in "Issues". I would gladly add them in next release!


Full manual, from --help:

Usage: topuniq [options] [FILE...]

If FILE is not given, read from standard input. For numeric input
options, NUM must be a positive integer (digits only). All options
requiring arguments accept both --option=ARG or --option ARG forms
Options not listed here, if any, are appended to uniq -c

Options:
  -h|--help              show this page.

  --min-count=NUM        only print lines with count >= NUM
  --min-perc=NUM         only print lines with count percent >= NUM%
  --top=NUM              only print the top NUM lines. 0 = all lines
  --top-perc=NUM         only print the top NUM% lines

  All lines with count less than any of the above options will be
  grouped together as a single <other> line, printed last by default.
  Setting a minimum higher than total, either count or percentage,
  will effectively disable printing the <total> line. For --top-*
  options, NUM does not include the total.

  --stop-after-top=NUM   stop reading after NUM top unique lines
  --stop-after-count=NUM stop reading after lines with count < NUM

  Unlike --min-* and --top-* options, the above will discard lines,
  thus affecting <total>, <other> and all percentages.
  --stop-after-top is equivalent to 'head -nNUM' after sort -nr and
  before topuniq's enhancements. For both, NUM=0 disables the option

  --precision=NUM        use NUM decimal digits for the percentages,
                         default 1

  --no-perc              do not print percentages
  --no-total             do not print <total> line
  --no-other             do not print <other> line

  --total-last           print <total> line last instead of first
  --sort-other           print <other> line in sorted position

  --label-total=LABEL use LABEL for <total> line, default "Total (%d)"
  --label-other=LABEL use LABEL for <other> line, default "Other (%d)"

  For the --label-* options, optional "%d" prints the number of unique
  lines that <total> or <other> represents

  --enhance-uniq       consider input as already processed by
                       sort | uniq -c, skip it and process from there.
                       Useful for enhancing previously saved data

Environment Variables:

  topuniq uses sort and uniq, so the user locale, particularly
  LC_COLLATE, affects ordering and unique matching, as well as sort
  performance. LC_NUMERIC affects decimal separator when printing
  percentages. Use LC_ALL=C for the fastest and locale-independent
  results.

Examples:

# Ignore lines with count < 10%, using case-insensitive uniq
topuniq --min-perc=10 --no-other --ignore-case

# Top 20, sorting <others> within the list, and customizing its label
topuniq --top=20 --sort-other --label-other="Other %d unique lines"

# Enhance an existing input, discarding lines with count < 10
topuniq my_uniq_data.txt --enhance-uniq --stop-after-count=10

# Behaves exactly like sort | uniq -c | sort -nr
topuniq --no-total --no-perc

For input data, some examples you may pipe directly to topuniq:

# Words in Bash's manual page
man bash | tr '[:punct:][:blank:]' '\n' | sed '/^$/d'

# Icon types in /usr/share/icons
find /usr/share/icons -type f -name "*.*" | awk -F. '{print $NF}'

# Shebangs from /usr/bin scripts
for f in /usr/bin/*; do [ -f "" ] && head -n1 "" | grep ^#!; done

Copyright (C) 2012 Rodrigo Silva (MestreLion) <linux@rodrigosilva.com>
License: GPLv3 or later. See <http://www.gnu.org/licenses/gpl.html>