Skip to content

Memory Hierarchy Description and Rationale

UlisesLuzius edited this page May 9, 2022 · 4 revisions

This wiki page explains the design decisions of the hardware components in the FPGA related to the memory hierarchy.

Goal. Increase system performance by using lower latency resources for the FPGA's pipeline: SRAM (BRAM/URAM) and DRAM. We are in a context of two semi-independent systems, a full-system emulator and an FPGA soft-core, this makes the system design more challenging given the synchronisation requirements and the multiple layers of address translations: physical address (PA) and virtual addresses (VA) for host, emulator, and FPGA.

Challenges.

  • Initially, the emulator has the complete state and the FPGA has no state available
  • The emulator only intervenes to execute an instruction in rare cases
  • The pipeline can only see the virtual address space
  • The emulator and the pipeline are able to execute instructions accessing the same virtual memory location in different physical address spaces (host/FPGA) requiring some sort of synchronisation
  • FPGA's physical address scheme is independent of the host or emulator's physical address scheme
  • We must rely on the emulator’s internal functions to help us interact with the guest’s physical space
  • The system running multiple processor contexts can have multiple virtual addresses spaces:
    -> homonyms, identical VAs pointing to different PAs
    -> synonyms, multiple VAs pointing to same PA
  • The FPGA's DRAM of 64GB can require up to 2e24 pages translations
  • Memory operations to devices/IO address space

Requirements.

  1. FPGA's physical address space requires a structure to store virtual address (VA->PA) mappings
  2. FPGA's physical address space requires a mechanism to keep track of pages locations which are free or used
  3. Two identical virtual addresses with different guest physical addresses should point to two different physical addresses (homonyms)
  4. Two different virtual addresses that point to the same guest physical address should also point to the same FPGA physical address (synonyms)
  5. We must be able to extract the guest physical address from its virtual address independent of QEMU execution flow
  6. When QEMU executes a memory instruction, it should detect whenever that page is present in the FPGA and synchrosise the page before the execution of the instruction

Supporting Structures and Behaviour.

  1. A new independent Page Table (PT) (guest VA -> FPGA PA)
  2. A stack with all the free physical addresses
  3. The virtually addressed translations must use the address space id (asid)
  4. A Reverse Map from guest PA -> VA to detect synonyms
  5. Hijack the emulator's internal translation system
  6. Memory instructions in the emulator must be instrumented to check and synchronise page before executing in case the page is in the FPGA

Design

// Insert image here with high level overview

Any element has the choice to either be managed by the emulator/host or managed by hardware/FPGA. Rationale behind this decision is usually dictated by latency and bandwidth requirements, or sometimes because it must rely on emulator internal mechanisms to complete.

FPGA Physical Address Space Management. In this section we mention all the mechanism that allow the virtual addressing of FPGA's physical memory.

Page Table

The Page Table stores mappings from VA to FPGA PA, it is located and managed in the FPGA.

The Page Table can not be located or managed by the host because of the following reasons:

  1. It should be able to hold up to 2e24 translations -> PT is too big to fit in BRAM and must be located in DRAM
  2. TLB misses result in a Page Table walk and are critical for performance -> PT must be accessible from the FPGA, Page Table must be in FPGA DRAM
  3. Page Table Entries (PTE) can be updated when page is modified, or to keep history for eviction policy -> PT must be modifiable by the FPGA, can't have "a PT managed by emulator but with a shadow copy in FPGA"

Given that the Page Table is located in the FPGA's DRAM, it requires a hardware Page Walker to walk, insert, modify and evict.

We choose a set associative (16-way) hash table to store the Page Table:

  1. Low latency for operations, a single or two DRAM operation to walk, insert, modify or evict
  2. Easy to implement the Page Walker in hardware
  3. DRAM space available in FPGA allows for over provisioning to improve performance Note: Eviction policy for the table is PseudoLRU given that not every memory access will result in a Page Walk.

Free physical address stack

To assign a physical page location on a page fault, we keep track of all free physical addresses available using a stack. There's 3 transactions that could modify either pop or push an entry to the stack:

  1. Page faults from the FPGA following a page walk miss
  2. Page evictions from the FPGA due to a PT set running out of associability
  3. Page evictions requests from the emulator due to executing an emulator memory instruction to an address in the FPGA

The stack pops a new physical address on page faults in case no synonym was detected. The stack frees the physical address in case no synonym is still pointing to that PA.

This structure should be located in the host:

  1. Transactions listed before already require intervention from the host and emulator
  2. To take the decision to either pop or push a free PA address one must detect synonyms, which is done in the host

Inverted Page Table (Reverse Map in Linux)

Virtual Memory can have multiple virtual addresses pointing to the same physical address. In classical systems, the OS keeps and updates a structure named Reverse Map, this structure is used when a physical page is deallocated to destroy all the corresponding virtual address mappings. Here, we use QEMU translation system to detect synonyms and store in software a Reverse Map of the FPGA Page Table, this is totally abstracted from the OS running inside QEMU and to QEMU itself.

Actions that require updating the Reverse Map are:

  1. On a page fault completion, add the new translation to the Reverse Map
  2. On a page translation eviction, remove the translation from the Reverse Map
  3. On a physical page eviction, use the Reverse Map to evict all other translations mapping to the same page

Given that these 3 operations requires already the intervention of the host and QEMU, this structure is located and managed by the host.

Emulator Translation System

Memory Management Unit

The memory management unit does all the operations related to address translations, it works in collaboration with the emulator. Initially the FPGA has zero state, thus memory state is brought in a lazy way: as the pipeline tries to access pages, the MMU requests the pages from the emulator.

Three structures are required for the MMU to fill it's role: Page Table (PT), Free Physical Address stack, and Inverted Page Table (IPT). Two modifications to the emulator are brought: 1) to interact with the emulator address space, we hijack its translation functions; 2) to ensure correct synchronisation, memory instructions are instrumented to perform the needed checks.

Caches Hierarchy and TLBs

State Synchronisation

Emulator Memory Instruction Instrumentation


FAQ: