Sometimes it can be very informative to extract the most oft seen items from one|more columns of a very large, possibly compressed on storage input stream - essentially an incomplete histogram. These could be time-stamp stripped log messages from a long list of errors or warnings, trending posts on some social media site (sometimes called "heavy hitters"), or any other situation where a huge fraction of activity is dominated by a few hundred or thousand popular items. (Often this shows up in anything with Zipfian distributions).
Full hash table based histograms can require a great deal of memory. If you can
be sure your deployment has enough, uncontended low latency DIMM memory to
support the full histogram operation, those are usually still the fastest way.
An only slightly slower way taking dramatically less space is the Count-Min
Sketch which is what
oft
uses via the adix/amoft
(approximately most oft) module.
oft [optional-params] [specs: string...]
Write most often seen N keys in various columns to outFile's. Specs are
<n>[,<keyCol>(0)[,outFile(stdout)]]. ColNos are Py-like 0-origin,signed.
Algorithm is approximate fast one-pass over mmap|stream input. E.g., to write
most frequent final column to stdout do: oft 10,-1. (For exact, see lfreq,
possibly with column splitting to FIFOs).
-i=, --input= string "/dev/stdin" input data path
-d=, --delim= string " " delimiting (repeats=>any num; "white")
-m=, --mxCol= int 0 max columns in input to parse
-e=, --errate= float 0.005 size tables to make err nSamp*this
-c=, --cover= float 0.98 enough tables to make coverage this
-s=, --salts= ints {} override random salts
A classic example along these lines is word histogramming a la Knuth McIlroy { whose good-natured, but competitive framework has spawned many follow-on caffeine-fuled rage optimization comparisons even though Knuth was clearly "set up", at least in how "history"/culture has "taken" the contest, given literal in-text caveats McIlroy made at the time }. Someone should actually do a meta-page linking to the many solutions, discussions, and various competitions. (Well, someone probably has..it is a real sub-culture. Feel free to PR a link.)
$ curl https://www.gutenberg.org/files/98/98-0.txt > to2c.txt
$ tr A-Z a-z < to2c.txt | tr -sc a-z \\n >/dev/shm/to2c.wrd
$ oft 10 </dev/shm/to2c.wrd
1957 that
1990 i
2011 his
2082 it
2664 in
3016 a
3653 to
4142 of
5067 and
8242 the
$ lfreq -n10 -d98304 < to2c.wrd # from adix/util
1956 that
1990 i
2011 his
2082 it
2664 in
3016 a
3653 to
4142 of
5067 and
8242 the
Notice how the results are nearly the same - only an over-count by 1 on the 10th most seen word, "that". Such over-counting is the only way this particular approximation fails. (There are optional parameters to fiddle with accuracy of the count-min sketch.)
It is a bit slower if you don't really need all the memory savings, though:
$ tim 'lfreq -n10 -d98304 <to2c.wrd>/dev/null' 'oft 10 <to2c.wrd>/dev/null'
(6.189 +- 0.019)e-03 lfreq -n10 -d98304 <to2c.wrd>/dev/null
0.014735 +- 0.000012 oft 10 <to2c.wrd>/dev/null
So, ~2.4x slower with the default accuracy settings. Hash tables lean hard on random access (or BranchPred / SpecExec / pre-fetches leading to later fetches). Sometimes a "cache cliff" {a large move in a latency diagram as generated by memlat} can be >>2.4x, though. E.g., it is often 8X from CPU to DIMMs. In this case, that might mean the problem was about larger vocabularies than Dickens uses in one book. So, this could be an interesting data structure in settings that do not exhaust DIMM memory. E.g., it may help to fit in L2 CPU cache in parallel settings since shared L3 caches face more contention.1
Footnotes
-
Like their cousin Bloom filters before them {saving O(~2..4X) space vs. fingerprint tables @a cost of O(6..12X) latency aka "must-have predictability"}, count-min sketches are not as compelling as they may sound at first. See https://github.com/c-blake/adix/blob/master/tests/bl.nim ↩