This is a pure Nim implementation of the paper (click to download) :
H. Inoue, "Multi-step LRU: SIMD-based Cache Replacement for Lower Overhead and Higher Precision," 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 2021, pp. 174-180, doi: 10.1109/BigData52589.2021.9671363. keywords: {Conferences;Memory management;Metadata;Big Data;Throughput;Registers;History;Cache replacement;LRU;SIMD},
- this implementation requires a X64-CPU with AVX2-support. Support for other SIMD-variants (AArch64, RISC-V, AVX512) might be done as soon as i find a suitable dev- and test-environment. So you may decide to do-it-yourself :) There is not much SIMD-intrinsics-code in here. See the intrinsic-operations at the end of the paper and
src/util_simd_avx2.nim
- atm the supported
Cache[K,V]
key-/value-types areint64
oruint64
only. - a zero-key=
0
represents a deleted-key a.k.a. empty-slot. Thus a zero-key is not a valid cache-key. - the concurrent-version
-d:multi
uses a naive spin-lock. This could be done more efficiently. Lock-byte aligning might prevent false sharing on certain platforms in multithreaded-mode.
-d:multi --mm:atomicArc
enables threading-support.atomicArc
must be set.-d:increaseCacheUsage
activates a usage optimization not found in H. Inoues paper. It does not allow for 'gaps' to appear inside vector-segments. Gaps might occur when a member is deleted from the cache viaunset( key )
. See remarks and the example in./src/sizeOptimization.include.nim
to understand the idea behind it.-d:debug
produces step-by-step debug-info.-d:profile
produces a detailed measurement of operations on theCache[K,V]
.
git clone https://github.com/Asc2011/Multi-step-LRU.git
cd ./Multi-step-LRU
nim c -d:release ./test/test_multiStepLRU.nim
nim c -d:release -d:multi --mm:atomicArc ./test/test_multiStepLRU.nim
nim c -d:release -d:increaseCacheUsage -d:multi --mm:atomicArc ./test/test_multiStepLRU.nim
I can not yet reproduce the excellent thruput/hit-ratio numbers from H. Inoues paper. Since Multi-step-LRU does neither require a lock-free Hash-Map nor a lock-free Doubly-Linked-List a 'fair' comparison would stay in single-threaded mode.
From a quick&dirty comparison against LRUCache from github.com/jackhftang/lrucache
i can confirm a gain in thruput around 3-to-5-times in single-threaded mode (see test/test_lrucache.nim
).
- wait for
github.com/guzba/nimsimd
version 1.2.11 to accept my pull-request or parts of it. - find a cache-trace and test/compare against the trace-data.
- generate a zipfian-distribution and use it during testing.
- bench against a classic/single-threaded LRU-implementation ->
github.com/jackhftang/lrucache
. Partly done, seetest/test_lrucache.nim
- maybe create some MarkDeep-schema.