Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'huge_alloc' into master
mm: page allocator for huge pages As memory sizes continue go outgrow TLB sizes, huge pages are shifting from being a nice-to-have optimization for HPC workloads to becoming a necessity. On Meta's 64G webservers - far from an exotic memory size - 4k pages result in 20% of total CPU cycles being spent on TLB misses. However, in trying to deploy THP more universally, we observe a fragmentation problem in the page allocator that routinely prevents higher order requests from being met quickly, or met at all. Despite existing defrag efforts in the allocator, such as mobility grouping and watermark boosting, pages of different migratetypes are commonly found to be sharing pageblocks. This results in inefficient or altogether ineffective reclaim/compaction of larger pages. We also found that this effect isn't necessarily tied to long uptimes. As an example, only 20min of build load under moderate memory pressure already results in a significant number of typemixed blocks: total blocks: 900 unmovable 50 movable 701 reclaimable 149 unmovable blocks with slab/lru pages: 13 ({'slab': 17, 'lru': 19} pages) movable blocks with non-LRU pages: 77 ({'slab': 4257, 'kmem': 77, 'other': 2} pages) reclaimable blocks with non-slab pages: 16 ({'lru': 37, 'kmem': 311, 'other': 26} pages) blocks with nonmovable: 313 For comparison, with this series applied: total blocks: 900 unmovable 65 movable 457 reclaimable 159 free 219 unmovable blocks with slab/lru pages: 22 ({'slab': 0, 'lru': 38} pages) movable blocks with non-LRU pages: 0 ({'slab': 0, 'kmem': 0, 'other': 0} pages) reclaimable blocks with non-slab pages: 3 ({'lru': 36, 'kmem': 0, 'other': 23} pages) blocks with nonmovable: 266 (The remaining "mixed blocks" in the patched kernel are false negatives - LRU pages without migrate callbacks (empty_aops e.g.) and i915 shmem that are pinned until reclaimed through shrinkers.) <insert some data from the fleet here> One of the behaviors that sabotage the page allocator's mobility grouping is the fact that requests of one migratetype are allowed to fall back into blocks of another type before reclaim and compaction occur. This is a design decision to prioritize memory utilization over avoiding block fragmentation - especially considering the history of lumpy reclaim and its tendency to drastically overreclaim in its pursuit of contiguity. However, with compaction available, these two goals are no longer in conflict: the scratch space of free pages for compaction to work is only twice the size of the allocation request; in most cases, only small amounts of proactive, coordinated reclaim and compaction is required to prevent a fallback which may fragment a pageblock indefinitely. Another problem lies in how the page allocator drives reclaim and compaction when it does invoke it. While the page allocator targets migratetype grouping at the pageblock level, it calls reclaim and compaction with the order of the allocation request; for order-0 requests, compaction isn't invoked at all. Since many allocations are smaller than a pageblock, this results in partial block freeing and subsequent fallbacks and type mixing. By the time a hugepage request finally does invoke reclaim/compaction for a whole pageblock, the address space is frequently already fragmented beyond repair. Note that in combination, these two design decisions have a self-reinforcing effect on fragmentation: 1. Partially used unmovable blocks are filled up with fallback movable pages. 2. A subsequent unmovable allocation, instead of grouping up, will then need to enter reclaim, which most likely results in a partially freed movable block that it falls back into. Over time, unmovable allocations are sparsely scattered throughout the address space and poison most pageblocks. Reclaim based on request size also means that block fragmentation is driven by the presence of lower order requests. It is not reliably mitigated by the mere presence of higher-order requests. This series proposes to fix the fragmentation issue by aligning the allocator and reclaim on a common defragmentation block size, and making pageblocks the base unit for managing free memory. A neutral pageblock type is introduced, MIGRATE_FREE. The first allocation to be placed into such a block claims it exclusively for the allocation's migratetype. Fallbacks from a different type are no longer allowed, and the block is "kept open" for more allocations of the same type to ensure tight grouping. A pageblock becomes neutral again only once all its pages have been freed. Reclaim and compaction are changed from partial block reclaim to producing whole neutral page blocks. The watermark logic is adjusted to apply to neutral blocks, ensuring that background and direct reclaim always maintain a readily-available reserve of them. The defragmentation effort changes from reactive to proactive. In turn, this makes defragmentation actually more efficient: compaction only has to scan movable blocks and can skip other types entirely; since movable blocks aren't poisoned by unmovable pages, the chances of successful compaction in each block are greatly improved as well. Defragmentation becomes an ongoing responsibility of all allocations, rather than being the burden of only higher-order asks. This prevents sub-block allocations - which cause block fragmentation in the first place - from starving the increasingly important larger requests. There is a slight increase in worst-case memory overhead by requiring the watermarks to be met against neutral blocks even when there might be free pages in typed blocks. However, the high watermarks are less than 1% of the zone, so the increase is relatively small. These changes only apply to CONFIG_COMPACTION kernels. Without compaction, fallbacks and partial block reclaim remain the best trade-off between utilization and fragmentation. Documentation/admin-guide/sysctl/vm.rst | 21 - block/bdev.c | 2 +- include/linux/compaction.h | 8 +- include/linux/gfp.h | 2 - include/linux/mm.h | 1 - include/linux/mmzone.h | 30 +- include/linux/page-isolation.h | 28 +- include/linux/pageblock-flags.h | 4 +- include/linux/vmstat.h | 8 - kernel/sysctl.c | 8 - mm/compaction.c | 407 ++++++-------- mm/internal.h | 14 +- mm/memory_hotplug.c | 4 +- mm/page_alloc.c | 866 +++++++++++++----------------- mm/page_isolation.c | 42 +- mm/vmscan.c | 251 +++------ mm/vmstat.c | 6 +- 17 files changed, 679 insertions(+), 1023 deletions(-)
- Loading branch information