Lit Review: Programmable Packet Scheduling on SmartNICs #11
anshumanmohan
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Background
A data center acts like a multiplexer, matching tasks ("packets") with CPUs that can accomplish those tasks ("cores"). The tasks may be time sensitive or may be "batch mode" style work that is not as time-sensitive. Doing this multiplexing at the NIC makes sense, since every packet must go through the NIC anyway. SmartNICs to the rescue! That said, a smartNIC cannot do this scheduling in a vacuum; it needs to coordinate with the CPU cores, which have additional info.
Elastic RSS
Rucker et al at APNet '19 | Paper
Major goals:
Minor goals:
Citeable facts:
Further reading:
Questions:
This paper argues that packets and cores should be co-scheduled at the NIC. They want to use Taurus, a programmable NIC, which has a map-reduce abstraction. Map: for each packet, find the weighted consistent hashing distance to each core. Reduce: for each packet, find the closest allocated core.
Scheduling is at two timescales: fine-grained, per-packet processing at NIC, coarse-grained state management at CPU.
PANIC
Lin et. al. at OSDI '20 | Paper
Major goals:
Citeable facts:
This paper contributes a new NIC design that combines a variety of offloads into chains that can then be consumed efficiently by a pipelined core.
They want the design to be in 4 parts:
Packets are given a "PANIC descriptor", then pushed to the scheduler, which then buffers the packet until the first destination core is idle. It then pushes the packet to that idle core. A core may further push the packet directly to another core, without going back to the scheduler. If a core receives a packet while not idle, it may push the packet to the scheduler's buffer and then pull it later on. You more or less reverse this for transmission.
AlNiCo
Li et. al. at USENIX ATC '22 | Paper
Major goals:
Citeable facts:
Questions:
This paper highlights two challenges with transaction scheduling:
They have a new way to compactly representing the state of contention (a function of the "request state" of the packet, the "worker state" of each CPU, and the "global state" of the whole server), which allows them to calculate the possibility of contention at the hardware level.
Contention (where two transactions access the same record, and at least one of those transaction is a write) is bad because it leads to an abort. Aborts can cascade. That said, we cannot aim for perfect contention-awareness all the time, since that would slow us down too much. This paper seeks to minimize this contention without slowing things down.
Clients of AlNiCo must tag packets with a fixed-form header called the "request feature vector". The data plus the header is sent to the scheduler. The scheduler looks at these, plus the state of the worker threads, and notifies the worker threads of their next tasks. This notification is just an address for the data, not the data itself. When the thread is ready, it pulls the data from a buffer, does its work, and transmits the answer to its client.
FlowValve: Packet Scheduling Offloaded on NP-based SmartNICs
Xi et al at DCS '22 | Paper
Major goals:
Citeable facts:
Further reading:
Questions:
They abstract over the existing queues on a NIC to create a logical FIFO. They "perform specialized tail drops to mix the FIFO queue with expected flow proportions". I'm not totally sure what they mean, but the line is: "Unlike common tail drop, FlowValve prejudges which packet would cause buffer overflow to its belonged traffic class. Then it explicitly drops this packet in advance. In this way, FlowValve assigns buffers conceptually."
Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency
Kaffes et al at NSDI '19 | Paper
Major goals:
Citeable facts:
Further reading:
Questions:
RackSched: A Microsecond-Scale Scheduler for Rack-Scale Computers
Zhu et al at OSDI '20 | Paper
Major goals:
Citeable facts:
The Shinjuku paper has shown that cFCFS and PS are ideal policies in many circumstances. The Shinjuku system also approximates these policies at the intra-server level. This paper finds that running JSQ (join the shortest queue) at the inter-server level actually approximates cFCFS/PS at the inter-server level. That is, the entire rack appears to run Shinjuku's cFCFS/PS.
Pretty cool! But one wonders what the limitations of this observation are... we first commit to a server and then let the server run its buffering and scheduling routines using Shinjuku. Surely there is a flexibility cost to this? Shinjuku's secret sauce was preemption, and that is not possible at the inter-server level?
TODO
Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads
Ousterhout et al at NSDI '19 | Paper
Loom: Flexible and Efficient NIC Packet Scheduling
Stephens et al at NSDI '19 | Paper
Others cite this paper as the SOTA on programmable packet scheduling on NICs. Note, un-smart, so it would take years to tape out onto ASICs.
SENIC: Scalable NIC for End-Host Rate Limiting
Radhakrishnan et al at NSDI '14 | Paper
Eiffel: Efficient and Flexible Software Packet Scheduling
Saeed et al at NSDI '19 | Paper
A large-scale deployment of DCTCP
Dhamija et al at NSDI '24 | Paper
See §4.4
OS Scheduling: Better scheduling policies for modern computing systems
Kaffes in CACM | Paper
A review of the SOTA in OS scheduling by the Shinjuku lead author!
Beta Was this translation helpful? Give feedback.
All reactions