Jemalloc Mempool and Adaptation for CPU HASHTABLE #4154
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Jemalloc Mempool and Adaptation for CPU HASHTABLE
1 Current Status
Code: https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/src/dram_kv_embedding_cache
In the current implementation, the value structure of the CPU hashtable uses std::vector, which directly requests memory from the system via std::allocator. Without memory pool management, frequent allocations incur significant system call overhead.
2 Proposed Solution
2.1 Overview
Given that the memory requirements for embedding allocations are known upfront, we can leverage this information to customize bin sizes more appropriately.
Our scenario (embedding hashtable) does not involve frequent memory deallocations (only during ID eviction) but requires frequent allocations. Freed memory can be directly reused for new allocations, minimizing fragmentation concerns.
Compared to jemalloc, our approach avoids complex Buddy/Slob algorithms, memory block merging, and multi-scale slot designs. The core idea is:
Implement dedicated memory pool management for each shard of fixed emb_dim hashtables (including merged tables)
Reuse existing lock mechanisms from SynchronizedShardedMap for concurrency control.
Create a "lock-free per-table memory pool" design that minimizes code invasiveness.
Design features:
Leverages existing SharedMutexWritePriority from SynchronizedShardedMap
Memory pool operations share the same critical section with hashtable insertions
2.2 FixedBlockPool Design
Three-Level Structure Model
Core Data Structures
Workflow
Alignment
Ensure block addresses meet alignment requirements (power-of-2 alignment, block size multiples)
2.3 Implementation Details
Chunk Handling
Custom memory_resource Class
Special Case Handling
For allocations ≤8 bytes (sizeof(void*)): Requires additional handling to prevent metadata overwrite
3 Benefit Analysis
Comparative advantages vs baseline & jemalloc:
vs Jemalloc
SynchronizedShardedMap
through lock-free per-table memory pool designFree Block Management
std::pmr Advantages