Lit Review: Programmable Packet Scheduling on SmartNICs #11

anshumanmohan · 2024-06-11T21:05:41Z

anshumanmohan
Jun 11, 2024
Maintainer

Background

A data center acts like a multiplexer, matching tasks ("packets") with CPUs that can accomplish those tasks ("cores"). The tasks may be time sensitive or may be "batch mode" style work that is not as time-sensitive. Doing this multiplexing at the NIC makes sense, since every packet must go through the NIC anyway. SmartNICs to the rescue! That said, a smartNIC cannot do this scheduling in a vacuum; it needs to coordinate with the CPU cores, which have additional info.

Elastic RSS

Rucker et al at APNet '19 | Paper

Major goals:

Don't dedicate a CPU core to TM, as this will bottleneck your throughput.
Schedule the traffic intelligently, with the ability to respond to upstream changes.

Minor goals:

Line rate operation
Work conservation
CPU utilization, e.g. don't dedicate a core to TM, handle bursts without leaving cores underutilized, etc.
Dispersion tolerance
Packet stealing

Citeable facts:

Real-world tail latency is 100 microseconds.
Packet stealing improves tail latency in case of head-of-line blocking at a core.

PANIC

Lin et. al. at OSDI '20 | Paper

Major goals:

Offload variety: hardware and software offloads
Offload chaining: avoid wasting area on redundant functionality. Instead, provision the functionality in one place, and cleverly chain tasks to get them consumed by the right core.
Multi-tenant isolation: tenants should not be able to consume more than their allocation of a shared offload. Required on-the-fly reprogramming of the TM.
Variable-perfomance offloads: latency sensitive, or not.
Line rate

Citeable facts:

Categorization of existing NIC designs into three kinds, along with their limitations when it comes to the five major goals above.

This paper contributes a new NIC design that combines a variety of offloads into chains that can then be consumed efficiently by a pipelined core.

They want the design to be in 4 parts:

RMT pipeline that makes packets into chains
Fast switching fabric that connects everything
Scheduler
Cores, each running one offload

Packets are given a "PANIC descriptor", then pushed to the scheduler, which then buffers the packet until the first destination core is idle. It then pushes the packet to that idle core. A core may further push the packet directly to another core, without going back to the scheduler. If a core receives a packet while not idle, it may push the packet to the scheduler's buffer and then pull it later on. You more or less reverse this for transmission.

AlNiCo

Li et. al. at USENIX ATC '22 | Paper

Major goals:

Contention awareness: minimize inter-transaction contention over cores.

Citeable facts:

Two existing methods of scheduling, broadly. (Section 2.2). These are:
- Static data partitioning, where the client knows a partition scheme and sends its packets to some known core directly. Con: can't handle cases that are dynamic, or that don't partition nicely.
- Batching-based scheduling, where worker threads dynamically collect a batch of transactions, divide the batch into groups, and minimize contention between groups. Con: takes time to make the groups.
FPGA-armed smartNICs are either "on-path" or "off-path". Off-path is what this paper (and, indeed, we) are thinking of: traffic flows through the NIC as normal, and some of that traffic may be sent to the FPGA using a PCIe link (Section 2.3).

Questions:

I don't fully get how their "feedback mechanism" (between the upper-level transaction software and the smartNIC) works (Section 3.3).

This paper highlights two challenges with transaction scheduling:

The metrics by which we want to schedule a packet may be sophisticated and dynamic, not just a 5-tuple.
Calculating which CPU core is best suited to deal with a packet, without resource contention, is hard and takes up cycles.
They have a new way to compactly representing the state of contention (a function of the "request state" of the packet, the "worker state" of each CPU, and the "global state" of the whole server), which allows them to calculate the possibility of contention at the hardware level.

Contention (where two transactions access the same record, and at least one of those transaction is a write) is bad because it leads to an abort. Aborts can cascade. That said, we cannot aim for perfect contention-awareness all the time, since that would slow us down too much. This paper seeks to minimize this contention without slowing things down.

Clients of AlNiCo must tag packets with a fixed-form header called the "request feature vector". The data plus the header is sent to the scheduler. The scheduler looks at these, plus the state of the worker threads, and notifies the worker threads of their next tasks. This notification is just an address for the data, not the data itself. When the thread is ready, it pulls the data from a buffer, does its work, and transmits the answer to its client.

FlowValve: Packet Scheduling Offloaded on NP-based SmartNICs

Xi et al at DCS '22 | Paper

Major goals:

Parallel, flexible packet scheduling on SmartNICs. Restricted to Network Processor (NP)-based SmartNICs.
They wish offload Linux features onto the SmartNIC. This is hard because:
- The naive offload gives low throughput. They develop parallel scheduling algorithms and new data structures.
- Naively offloading traffic shaping is expensive, since that requires buffering and resending packets. They simulate shaping via dropping.

Citeable facts:

Not citeable, but just a good sign: they have an example of a hierarchical scheduling policy in their motivation section! It mixes work conserving and non work conserving policies.

Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency

Kaffes et al at NSDI '19 | Paper

Major goals:

The issue is that workloads with high dispersion (packets get spread out to many workers, essentially the opposite of affinity to one core) and workloads with heavy tails (many small packets after a few big packets) get very poor service. This paper proposes a centralized scheduling mechanism featuring Linux-style preemption but at very fine granularity. This gives higher throughput and lower latency, thereby preventing the kinds of blocks we were seeing.
The fine granularity of preemption is achieved by lowering the preemptions to the hardware level. Moving to hardware is not a cure-all, and incurs its own slowdown. Some further jankiness, e.g. sharing of address spaces and "posted interrupts", are needed in order to avoid the slowdown that hardware would itself incur.
In reality they maintain a small number of FIFOs, one per "class" of packets. They have come up with a lightweight policy that determines which of those FIFOs should be popped next. The policy takes into account how long ago each FIFO was popped, and what service-level objectives the user stated for the class of packets in each FIFO. A preempted packet is reinserted either at the head or the tail of its queue, and this has varying effects. This can be modified based on the desired scheduling regime.
The end result is a scheduler that simulates cFCFS when the workload is easy, and PS when the workload is challenging. See below for what those policies are.
Note that this work is an opinionated departure away from the RSS-style work we've seen above. In their words: "If service times exhibit low dispersion and there are enough client connections for RSS to spread requests evenly across queues, stealing happens infrequently."
Overall I'm a fan. This is a top-notch paper that does some very helpful surveying of the field in addition to its own strong contribution. It has also been used extensively by RackSched (see below).
The video is a worth a watch too.

Citeable facts:

Centralized first-come-first-serve (cFCFS), where you maintain one central FIFO that any worker can pop to grab its next piece of work, is near-optimal for low-dispersion workloads.
Processor sharing (PS), where all requests receive a fine-grained and fair fraction of the available processing capacity, is near-optimal for heavy-tailed workloads or light-tailed workloads with high dispersion.

RackSched: A Microsecond-Scale Scheduler for Rack-Scale Computers

Zhu et al at OSDI '20 | Paper

Major goals:

They just want to build a huge machine: the entire rack operating essentially as one.
The individual servers themselves run Shinjuku for intra-server scheduling. This paper's contributions are in the higher-level inter-server scheduling space.
This inter-server scheduler does load balancing and honors request affinity. It also tracks server loads.

Citeable facts:

Mostly just a nice example of a paper that is trying to schedule servers with a ToR switch, analogously to how we are trying to schedule cores with a SmartNIC.
This is the first of many time we'll see RocksDB used in evaluation. It's a KV service with short GETs and long SCANs. You can craft experimental workloads by modulating how many GETs and SCANs a flow of requests has

The Shinjuku paper has shown that cFCFS and PS are ideal policies in many circumstances. The Shinjuku system also approximates these policies at the intra-server level. This paper finds that running JSQ (join the shortest queue) at the inter-server level actually approximates cFCFS/PS at the inter-server level. That is, the entire rack appears to run Shinjuku's cFCFS/PS.

Pretty cool! But one wonders what the limitations of this observation are... we first commit to a server and then let the server run its buffering and scheduling routines using Shinjuku. Surely there is a flexibility cost to this? Shinjuku's secret sauce was preemption, and that is not possible at the inter-server level?

TODO

Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads

Ousterhout et al at NSDI '19 | Paper

Loom: Flexible and Efficient NIC Packet Scheduling

Stephens et al at NSDI '19 | Paper

Others cite this paper as the SOTA on programmable packet scheduling on NICs. Note, un-smart, so it would take years to tape out onto ASICs.

SENIC: Scalable NIC for End-Host Rate Limiting

Radhakrishnan et al at NSDI '14 | Paper

Eiffel: Efficient and Flexible Software Packet Scheduling

Saeed et al at NSDI '19 | Paper

A large-scale deployment of DCTCP

Dhamija et al at NSDI '24 | Paper

See §4.4

OS Scheduling: Better scheduling policies for modern computing systems

Kaffes in CACM | Paper

A review of the SOTA in OS scheduling by the Shinjuku lead author!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lit Review: Programmable Packet Scheduling on SmartNICs #11

{{title}}

Replies: 0 comments

Select a reply

Lit Review: Programmable Packet Scheduling on SmartNICs #11

anshumanmohan Jun 11, 2024 Maintainer

Background

Elastic RSS

PANIC

AlNiCo

FlowValve: Packet Scheduling Offloaded on NP-based SmartNICs

Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency

RackSched: A Microsecond-Scale Scheduler for Rack-Scale Computers

TODO

Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads

Loom: Flexible and Efficient NIC Packet Scheduling

SENIC: Scalable NIC for End-Host Rate Limiting

Eiffel: Efficient and Flexible Software Packet Scheduling

A large-scale deployment of DCTCP

OS Scheduling: Better scheduling policies for modern computing systems

Replies: 0 comments

anshumanmohan
Jun 11, 2024
Maintainer