Skip to content

Jemalloc Mempool and Adaptation for CPU HASHTABLE #4154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

ArronHZG
Copy link

@ArronHZG ArronHZG commented May 20, 2025

Jemalloc Mempool and Adaptation for CPU HASHTABLE

1 Current Status

Code: https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/src/dram_kv_embedding_cache

In the current implementation, the value structure of the CPU hashtable uses std::vector, which directly requests memory from the system via std::allocator. Without memory pool management, frequent allocations incur significant system call overhead.

2 Proposed Solution

2.1 Overview

Given that the memory requirements for embedding allocations are known upfront, we can leverage this information to customize bin sizes more appropriately.

Our scenario (embedding hashtable) does not involve frequent memory deallocations (only during ID eviction) but requires frequent allocations. Freed memory can be directly reused for new allocations, minimizing fragmentation concerns.

Compared to jemalloc, our approach avoids complex Buddy/Slob algorithms, memory block merging, and multi-scale slot designs. The core idea is:

Implement dedicated memory pool management for each shard of fixed emb_dim hashtables (including merged tables)
Reuse existing lock mechanisms from SynchronizedShardedMap for concurrency control.
Create a "lock-free per-table memory pool" design that minimizes code invasiveness.

mempool

Design features:
Leverages existing SharedMutexWritePriority from SynchronizedShardedMap
Memory pool operations share the same critical section with hashtable insertions

template <typename K, typename V, typename M = folly::SharedMutexWritePriority>
class SynchronizedShardedMap {
 public:
   ...
 private:
  std::vector<folly::Synchronized<folly::F14FastMap<K, V>, M>> shards_;
  // New memory pool with same count as F14FastMap
  std::vector<FixedBlockPool> mempool_; 
};

2.2 FixedBlockPool Design

drawio

Three-Level Structure Model

  • Pool: Manages multiple chunks, maintains global free list, interfaces with PMR
  • Chunk: Preallocated contiguous memory (e.g., 1024 blocks), divided into equal-sized blocks with alignment
  • Block: Minimum allocation unit. Stores next-block pointer in first sizeof(void*) bytes when free

Core Data Structures

  • Free List: Singly-linked list using head-of-block pointers (O(1) alloc/dealloc)
  • Chunk List: Stores pointers to allocated chunks for destruction-time cleanup

Workflow

  • Allocate: Take head block from free list. Allocate new chunk when empty.
  • Deallocate: Return block to free list head.

Alignment

Ensure block addresses meet alignment requirements (power-of-2 alignment, block size multiples)

2.3 Implementation Details

Chunk Handling

  • Minimum chunk size: sizeof(void*) to store link pointers
  • Chunk splitting: Fixed-size blocks with zero metadata overhead
  • Memory alignment: Enforced during chunk allocation

Custom memory_resource Class

  • Inherits from std::pmr::memory_resource
  • Parameters: block_size, blocks_per_chunk, alignment
  • Maintains free list and chunk tracking

Special Case Handling

For allocations ≤8 bytes (sizeof(void*)): Requires additional handling to prevent metadata overwrite

3 Benefit Analysis

Comparative advantages vs baseline & jemalloc:

  1. vs Jemalloc

    • Tighter integration with SynchronizedShardedMap through lock-free per-table memory pool design
    • Avoids multi-threaded resource contention overhead via shared lock mechanism
  2. Free Block Management

    • O(1) allocation/deallocation via embedded pointer single-linked list
    • Uses block's own memory for free list pointers
  3. std::pmr Advantages

    • Enforces chunk alignment (cache-line friendly)
    • Reduces CPU cache-miss penalties
    • Built-in memory resource chaining support

Copy link

netlify bot commented May 20, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 463c152
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/682d99bec39bec0008ad198c
😎 Deploy Preview https://deploy-preview-4154--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@facebook-github-bot
Copy link
Contributor

Hi @ArronHZG!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@ArronHZG ArronHZG changed the title DramFixedBlockPool jemalloc Mempool and Adaptation for CPU HASHTABLE May 20, 2025
@ArronHZG ArronHZG changed the title jemalloc Mempool and Adaptation for CPU HASHTABLE Jemalloc Mempool and Adaptation for CPU HASHTABLE May 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants