A Clojure implementation of a rolling Rabin hash + chunker.
Bring up a REPL:
lein with-profile +dev
Use the CDC notebook to try chunking a dataset:
(comment
(def chunks
(atom nil))
(load-dataset! "data/natural_images" chunks)
(let [rows->long (fn [r] (into {} (map (fn [[k v]] [k (long v)])) r))
stats-all (chunk-ds->agg-stats @chunks)
stats-cdc (-> @chunks (ds/unique-by-column :sha256) chunk-ds->agg-stats)
stats-per-file (->> @chunks
(rd/group-by-column-agg :file {:block-count (rd/count-distinct :sha256)})
(rd/aggregate {:avg-blocks-per-file (rd/mean :block-count)}))
; maps containing aggregate vals
all (first (map rows->long (ds/rows stats-all)))
cdc (first (map rows->long (ds/rows stats-cdc)))
blocks (first (map rows->long (ds/rows stats-per-file)))
reduced-bytes (- (:total-bytes all)
(:total-bytes cdc))]
{:all all
:cdc cdc
:blocks blocks
:diff {:reduced-bytes reduced-bytes
:reduced-percent (->> (:total-bytes all)
(/ reduced-bytes)
(* 100)
double)}}))
@chunks
is a tech.ml.dataset
with all the chunks from the data loaded. Each chunk also gets a SHA-256 hash to ensure that the block is actually unique:
{:all {:total-bytes 359403192, :total-blocks 36628, :avg-block-size-bytes 9812},
:cdc {:total-bytes 179670613, :total-blocks 18311, :avg-block-size-bytes 9812},
:blocks {:avg-blocks-per-file 2},
:diff {:reduced-bytes 179732579, :reduced-percent 50.00862068025261}}
Rabin parameter overrides are shown, otherwise assume defaults from clj-rabin.hash/default-ctx
.
Overall: terrible ratios
Audio codec info:
Input #0, mp3, from 'MP3-Example/Blues/Blues-TRADWSG128F4259317.mp3':
Duration: 00:00:30.04, start: 0.025057, bitrate: 96 kb/s
Stream #0:0: Audio: mp3, 44100 Hz, stereo, fltp, 96 kb/s
Metadata:
encoder : LAME3.99r
Side data:
replaygain: track gain - -5.900000, track peak - unknown, album gain - unknown, album peak - unknown,
{:all {:total-bytes 544824272, :total-blocks 977662, :avg-block-size-bytes 557},
:cdc {:total-bytes 543861899, :total-blocks 43428, :avg-block-size-bytes 12523},
:blocks {:avg-blocks-per-file 651},
:diff {:reduced-bytes 962373, :reduced-percent 0.1766391567811795}}
Audio codec info:
Input #0, wav, from 'raga/asavari02.wav':
Duration: 00:03:46.82, bitrate: 705 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, 1 channels, s16, 705 kb/s
{:all {:total-bytes 1105687312, :total-blocks 73031, :avg-block-size-bytes 15139},
:cdc {:total-bytes 1105687295, :total-blocks 73014, :avg-block-size-bytes 15143},
:blocks {:avg-blocks-per-file 890},
:diff {:reduced-bytes 17, :reduced-percent 1.537505207439696E-6}}
Overall: awesome ratios on images
{:all {:total-bytes 359403192, :total-blocks 36628, :avg-block-size-bytes 9812},
:cdc {:total-bytes 179670613, :total-blocks 18311, :avg-block-size-bytes 9812},
:blocks {:avg-blocks-per-file 2},
:diff {:reduced-bytes 179732579, :reduced-percent 50.00862068025261}}
Ripe and unripe tomatoes (jpeg)
{:all {:total-bytes 122864035, :total-blocks 151613, :avg-block-size-bytes 810},
:cdc {:total-bytes 52075246, :total-blocks 29899, :avg-block-size-bytes 1741},
:blocks {:avg-blocks-per-file 428},
:diff {:reduced-bytes 70788789, :reduced-percent 57.6155495788495}}