title | authors | owning-sig | participating-sigs | reviewers | approvers | editor | creation-date | last-updated | status | see-also | replaces | superseded-by | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Memory Manager |
|
sig-node |
|
|
|
TBD |
2019-07-18 |
2019-07-18 |
implementable |
Authors:
- @bg-chun - Byonggon Chun <[email protected]>
- @ohsewon - Sewon Oh <[email protected]>
- @admanV - Hyunsung Cho <[email protected] >
NUMA Awareness is a well-known solution to boost performance for diverse use cases including DPDK and database. For those use cases, Kubernetes support NUMA-related features for containers such as CPU Manager, Device Manager, and Topology Manager which treat CPU, I/O devices via PCI-e, and NUMA topology, respectively; However, there's no feature for memory and hugepages. It can cause inter-NUMA node communication in memory access, which leads to increase an I/O latency and then decrease a performance of entire system.
To resolve the problem, Memory Manager is proposed for a solution to deploy pods and containers with guaranteed QoS class of memory and hugepages under NUMA awareness.
-
Node Topology Manager is a feature that collects topology hints from various hint providers to calculate socket affinity for a container. Topology Manager judges if container can admit under pre-configured NUMA policy and socket affinity.
-
CPU Manager is a feature that provides a solution for CPU Pinning based on cgroups cpuset subsystem. It also offer CPU-related topology hint to Topology Manager.
-
Device Manager is a feature that provides topology hint of I/O devices via PCI-e to Topology Manager. The main objectivity of Device Manager is to allow I/O device's vendors to adverise their resources to kubelet to be used for container or pod.
-
Hugepages is a feature to allow container to consume pre-allocated hugepages with pod isolation of hugepages.
-
Node Allocatable Feature is a feature to help to reserve computing resources to prevent resource starvation. The kube-reserved and system-reserved is used to reserve resources for kubelet and system(OS, etc). Now, the following resources can be reserved: cpu, memory, ephemral storage.
-
Guarantee alignment of resources for isolating container's memory and hugepages along with CPU and I/O devices.
-
Provide topology hint for memory and hugepages for Topology manager.
-
Updating scheduler and pod spec is out of scope at this point.
-
This proposal only focus on Linux based system.
- DPDK(Data Plane Development Kit) is set of libraries to accelerate packet processing on userspace. DPDK requires dedicated resources and alignment of resources under NUMA awareness. There should be a way to reserve memory and hugepages with other computing resources on specific socket for containerized VNF using DPDK.
- Databases(e.g., Oracle, PostgreSQL, MySQL, MongoDB, SAP) consume massive memory and hugepages in order to reduce memory access latency and improve performance dramatically. For this improvement, resources (i.e. CPU, memory, hugepages, I/O devices) should be aligned.
A new component of Kubelet enables NUMA-awareness for memory and hugepages. The main roles of this component are listed below.
NUMA node affinity represents that which NUMA node has enough capacity of memory and/or hugepages for a container.
To calculate affinity, Memory Manager gathers capacity of memory and pre-allocated hugepages per NUMA node except system and Kubenetes reserved capacity by Node Allocatable feature. Then when pod admit is requested, Memory Manager checks resources availablity of each NUMA node and if possible, it reserves the resources internally.
-
Node affinity of memory can be calcuated by below formulas.
- Allocatable Memory of NUMA node = Total Memory of NUMA node - Hugepages - system reserved - kubernetes reserved.
- Allocatable Memory of NUMA node >= Guaranteed memory for a container.
-
Node affinity of hugepage can be calculated by the below formula.
- Available huagpages of NUMA node = Total hugepages of NUMA node - reserved by system.
- Available huagpages of NUMA node >= Guaranteed hugepages for a container.
- Node Allocatable feature offers "system-reserved" flag to reserve resources of CPU, memory, and ephemeral-storage for system services and kernel. At this time, kernel memory usage of container is accounted to system-reserved.
- Node Allocatable feature offers "kube-reserved" flag to reserve resources for kublet and other kubernetes system daemons.
- Node Allocatable feature reserves resources using cgroups that actually set a limitation of resource usage per NUMA node.
- For this reason, it is hard to calculate allocatable memory at node level.
- In alpha, Memory Manager reserves same amount of system-reserved kube-reserved memory per NUMA node.
Topology Manager defines HintProvider interface to take node affinity of resources from resource managers(i.e. CPU Manager and Device Manager). Memory Manager implements the interface to calculate affinity of memory and hugepage and then provides calculated topology to Topology Manager.
Cgroups cpuset subsystem is used to isolate memory and hugepages. When pod admit is requested with guaranteed QoS class, Memory Manager restricts a container to utilize a CPU and momory in same NUMA node. In other cases, Memory Manager also restricts memory access to a single NUMA node regardless of CPU affinity.
Consequently, Memory Manager guarantees that container's memory and hugepages are isolated to a single NUMA node.
package memory manager
type State interface {
GetAllocatableMemory() uint64
GetMemoryAssignments(containerID string) (memorytopology.MemoryTopology, bool)
GetMachineMemory() memorytopology.MemoryTopology
GetDefaultMemoryAssignments() ContainerMemoryAssignments
SetAllocatableMemory(allocatableMemory uint64)
SetMemoryAssignments(containerID string, mt memorytopology.MemoryTopology)
SetMachineMemory(mt memorytopology.MemoryTopology)
SetDefaultMemoryAssignments(as ContainerMemoryAssignments)
Delete(containerID string)
ClearState()
}
type Manager interface {
Start(ActivePodsFunc, status.PodStatusProvider, runtimeService)
AddContainer(p *v1.Pod, c *v1.Container, containerID string) error
RemoveContainer(containerID string) error
State() state.Reader
GetTopologyHints(pod v1.Pod, container v1.Container) map[string][]topologymanager.TopologyHint
}
type Policy interface {
Name() string
Start(s state.State)
AddContainer(s state.State, pod *v1.Pod, container *v1.Container, containerID string) error
RemoveContainer(s state.State, containerID string) error
}
Listing: Memory Manager and related interfaces (sketch).
Figure: Memory Manager components.
Topology Manager takes topology hints(bits that represents NUMA node) from resource managers such as CPU Manager and Device Manager. Then, it calculates the best affinity under given policy(i.e. preferred or strict) to determine pod admission. Likewise other resource managers, Memory Manager provides hints to Topology Manager.
Memory Manager implements 'Manager' interface, so that Topology Manager be able to add Memory Manager as hint provider at initialization sequence.
package cm
func NewContainerManager(...) (ContainerManager, error) {
...
cm.topologyManager.AddHintProvider(cm.memoryManager)
...
}
Figure: Interfaces with topology manager
InternalContainerLifecycle interface defines following interfaces to manage container resources, PreStartContainer, PreStopContainer, PostStopContainer. In this cases, Memory Manager has to involve managing memory/hugepages when start/stop container.
Memory Manager manage memory with pod, container ID so these function needs that too.
package cm
func (i *internalContainerLifecycleImpl) PreStartContainer(...) {
...
err := i.memoryManager.AddContainer(pod, containerID)
...
}
func (i *internalContainerLifecycleImpl) PreStopContainer(...) {
...
err := i.memoryManager.RemoveContainer(pod, containerID)
...
}
func (i *internalContainerLifecycleImpl) PostStopContainer(...) {
...
err := i.memoryManager.RemoveContainer(pod, containerID)
...
}
A new feature gate will be added to enable the Memory Manager feature. This feature gate will be enabled in Kubelet, and will be disabled by default in the Alpha release.
- Proposed Feature Gate:
--feature-gate=MemoryManager=true
This will be also followed by a Kubelet Flag for the Memory Manager Policy, which is described above. The none
policy will be the default policy.
- Proposed Policy Flag:
--memory-manager-policy=none|preferred|singleNUMA
- Feature gate is disabled by default.
- Alpha Implimentation of Memory Manager based on SingleNUMA policy of Topology Manager
- TBD
- TBD