Merge branch 'huge_alloc' into master

mm: page allocator for huge pages As memory sizes continue go outgrow TLB sizes, huge pages are shifting from being a nice-to-have optimization for HPC workloads to becoming a necessity. On Meta's 64G webservers - far from an exotic memory size - 4k pages result in 20% of total CPU cycles being spent on TLB misses. However, in trying to deploy THP more universally, we observe a fragmentation problem in the page allocator that routinely prevents higher order requests from being met quickly, or met at all. Despite existing defrag efforts in the allocator, such as mobility grouping and watermark boosting, pages of different migratetypes are commonly found to be sharing pageblocks. This results in inefficient or altogether ineffective reclaim/compaction of larger pages. We also found that this effect isn't necessarily tied to long uptimes. As an example, only 20min of build load under moderate memory pressure already results in a significant number of typemixed blocks: total blocks: 900 unmovable 50 movable 701 reclaimable 149 unmovable blocks with slab/lru pages: 13 ({'slab': 17, 'lru': 19} pages) movable blocks with non-LRU pages: 77 ({'slab': 4257, 'kmem': 77, 'other': 2} pages) reclaimable blocks with non-slab pages: 16 ({'lru': 37, 'kmem': 311, 'other': 26} pages) blocks with nonmovable: 313 For comparison, with this series applied: total blocks: 900 unmovable 65 movable 457 reclaimable 159 free 219 unmovable blocks with slab/lru pages: 22 ({'slab': 0, 'lru': 38} pages) movable blocks with non-LRU pages: 0 ({'slab': 0, 'kmem': 0, 'other': 0} pages) reclaimable blocks with non-slab pages: 3 ({'lru': 36, 'kmem': 0, 'other': 23} pages) blocks with nonmovable: 266 (The remaining "mixed blocks" in the patched kernel are false negatives - LRU pages without migrate callbacks (empty_aops e.g.) and i915 shmem that are pinned until reclaimed through shrinkers.) <insert some data from the fleet here> One of the behaviors that sabotage the page allocator's mobility grouping is the fact that requests of one migratetype are allowed to fall back into blocks of another type before reclaim and compaction occur. This is a design decision to prioritize memory utilization over avoiding block fragmentation - especially considering the history of lumpy reclaim and its tendency to drastically overreclaim in its pursuit of contiguity. However, with compaction available, these two goals are no longer in conflict: the scratch space of free pages for compaction to work is only twice the size of the allocation request; in most cases, only small amounts of proactive, coordinated reclaim and compaction is required to prevent a fallback which may fragment a pageblock indefinitely. Another problem lies in how the page allocator drives reclaim and compaction when it does invoke it. While the page allocator targets migratetype grouping at the pageblock level, it calls reclaim and compaction with the order of the allocation request; for order-0 requests, compaction isn't invoked at all. Since many allocations are smaller than a pageblock, this results in partial block freeing and subsequent fallbacks and type mixing. By the time a hugepage request finally does invoke reclaim/compaction for a whole pageblock, the address space is frequently already fragmented beyond repair. Note that in combination, these two design decisions have a self-reinforcing effect on fragmentation: 1. Partially used unmovable blocks are filled up with fallback movable pages. 2. A subsequent unmovable allocation, instead of grouping up, will then need to enter reclaim, which most likely results in a partially freed movable block that it falls back into. Over time, unmovable allocations are sparsely scattered throughout the address space and poison most pageblocks. Reclaim based on request size also means that block fragmentation is driven by the presence of lower order requests. It is not reliably mitigated by the mere presence of higher-order requests. This series proposes to fix the fragmentation issue by aligning the allocator and reclaim on a common defragmentation block size, and making pageblocks the base unit for managing free memory. A neutral pageblock type is introduced, MIGRATE_FREE. The first allocation to be placed into such a block claims it exclusively for the allocation's migratetype. Fallbacks from a different type are no longer allowed, and the block is "kept open" for more allocations of the same type to ensure tight grouping. A pageblock becomes neutral again only once all its pages have been freed. Reclaim and compaction are changed from partial block reclaim to producing whole neutral page blocks. The watermark logic is adjusted to apply to neutral blocks, ensuring that background and direct reclaim always maintain a readily-available reserve of them. The defragmentation effort changes from reactive to proactive. In turn, this makes defragmentation actually more efficient: compaction only has to scan movable blocks and can skip other types entirely; since movable blocks aren't poisoned by unmovable pages, the chances of successful compaction in each block are greatly improved as well. Defragmentation becomes an ongoing responsibility of all allocations, rather than being the burden of only higher-order asks. This prevents sub-block allocations - which cause block fragmentation in the first place - from starving the increasingly important larger requests. There is a slight increase in worst-case memory overhead by requiring the watermarks to be met against neutral blocks even when there might be free pages in typed blocks. However, the high watermarks are less than 1% of the zone, so the increase is relatively small. These changes only apply to CONFIG_COMPACTION kernels. Without compaction, fallbacks and partial block reclaim remain the best trade-off between utilization and fragmentation. Documentation/admin-guide/sysctl/vm.rst | 21 - block/bdev.c | 2 +- include/linux/compaction.h | 8 +- include/linux/gfp.h | 2 - include/linux/mm.h | 1 - include/linux/mmzone.h | 30 +- include/linux/page-isolation.h | 28 +- include/linux/pageblock-flags.h | 4 +- include/linux/vmstat.h | 8 - kernel/sysctl.c | 8 - mm/compaction.c | 407 ++++++-------- mm/internal.h | 14 +- mm/memory_hotplug.c | 4 +- mm/page_alloc.c | 866 +++++++++++++----------------- mm/page_isolation.c | 42 +- mm/vmscan.c | 251 +++------ mm/vmstat.c | 6 +- 17 files changed, 679 insertions(+), 1023 deletions(-)
hnaz · Mar 9, 2023 · 57615ce · 57615ce
2 parents c9c3395 + 8e1c71f
commit 57615ce
Show file tree

Hide file tree

Showing 18 changed files with 681 additions and 1,023 deletions.
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
@@ -72,7 +72,6 @@ Currently, these files are in /proc/sys/vm:
 - unprivileged_userfaultfd
 - user_reserve_kbytes
 - vfs_cache_pressure
-- watermark_boost_factor
 - watermark_scale_factor
 - zone_reclaim_mode
 
@@ -968,26 +967,6 @@ directory and inode objects. With vfs_cache_pressure=1000, it will look for
 ten times more freeable objects than there are.
 
 
-watermark_boost_factor
-======================
-
-This factor controls the level of reclaim when memory is being fragmented.
-It defines the percentage of the high watermark of a zone that will be
-reclaimed if pages of different mobility are being mixed within pageblocks.
-The intent is that compaction has less work to do in the future and to
-increase the success rate of future high-order allocations such as SLUB
-allocations, THP and hugetlbfs pages.
-
-To make it sensible with respect to the watermark_scale_factor
-parameter, the unit is in fractions of 10,000. The default value of
-15,000 means that up to 150% of the high watermark will be reclaimed in the
-event of a pageblock being mixed due to fragmentation. The level of reclaim
-is determined by the number of fragmentation events that occurred in the
-recent past. If this value is smaller than a pageblock then a pageblocks
-worth of pages will be reclaimed (e.g.  2MB on 64-bit x86). A boost factor
-of 0 will disable the feature.
-
-
 watermark_scale_factor
 ======================
 

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
@@ -1229,6 +1229,8 @@ void __init setup_arch(char **cmdline_p)
 
 	if (boot_cpu_has(X86_FEATURE_GBPAGES))
 		hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
+	else
+		hugetlb_cma_reserve(PMD_SHIFT - PAGE_SHIFT);
 
 	/*
 	 * Reserve memory for crash kernel after SRAT is parsed so that it

diff --git a/block/bdev.c b/block/bdev.c
@@ -488,7 +488,7 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
 	inode->i_mode = S_IFBLK;
 	inode->i_rdev = 0;
 	inode->i_data.a_ops = &def_blk_aops;
-	mapping_set_gfp_mask(&inode->i_data, GFP_USER);
+	mapping_set_gfp_mask(&inode->i_data, GFP_USER|__GFP_MOVABLE);
 
 	bdev = I_BDEV(inode);
 	mutex_init(&bdev->bd_fsfreeze_mutex);

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
@@ -10,7 +10,6 @@ enum compact_priority {
 	COMPACT_PRIO_SYNC_FULL,
 	MIN_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_FULL,
 	COMPACT_PRIO_SYNC_LIGHT,
-	MIN_COMPACT_COSTLY_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
 	DEF_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
 	COMPACT_PRIO_ASYNC,
 	INIT_COMPACT_PRIORITY = COMPACT_PRIO_ASYNC
@@ -56,6 +55,7 @@ enum compact_result {
 };
 
 struct alloc_context; /* in mm/internal.h */
+struct capture_control; /* in mm/internal.h */
 
 /*
  * Number of free order-0 pages that should be available above given watermark
@@ -94,10 +94,10 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
 		unsigned int order, unsigned int alloc_flags,
 		const struct alloc_context *ac, enum compact_priority prio,
-		struct page **page);
+		struct capture_control *capc);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern enum compact_result compaction_suitable(struct zone *zone, int order,
-		unsigned int alloc_flags, int highest_zoneidx);
+					       int highest_zoneidx);
 
 extern void compaction_defer_reset(struct zone *zone, int order,
 				bool alloc_success);
@@ -187,7 +187,7 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
 }
 
 static inline enum compact_result compaction_suitable(struct zone *zone, int order,
-					int alloc_flags, int highest_zoneidx)
+						      int highest_zoneidx)
 {
 	return COMPACT_SKIPPED;
 }

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
@@ -19,8 +19,6 @@ static inline int gfp_migratetype(const gfp_t gfp_flags)
 	BUILD_BUG_ON((1UL << GFP_MOVABLE_SHIFT) != ___GFP_MOVABLE);
 	BUILD_BUG_ON((___GFP_MOVABLE >> GFP_MOVABLE_SHIFT) != MIGRATE_MOVABLE);
 	BUILD_BUG_ON((___GFP_RECLAIMABLE >> GFP_MOVABLE_SHIFT) != MIGRATE_RECLAIMABLE);
-	BUILD_BUG_ON(((___GFP_MOVABLE | ___GFP_RECLAIMABLE) >>
-		      GFP_MOVABLE_SHIFT) != MIGRATE_HIGHATOMIC);
 
 	if (unlikely(page_group_by_mobility_disabled))
 		return MIGRATE_UNMOVABLE;

diff --git a/include/linux/mm.h b/include/linux/mm.h
@@ -2746,7 +2746,6 @@ extern void setup_per_cpu_pageset(void);
 
 /* page_alloc.c */
 extern int min_free_kbytes;
-extern int watermark_boost_factor;
 extern int watermark_scale_factor;
 extern bool arch_has_descending_max_zone_pfns(void);
 

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
@@ -44,7 +44,7 @@ enum migratetype {
 	MIGRATE_MOVABLE,
 	MIGRATE_RECLAIMABLE,
 	MIGRATE_PCPTYPES,	/* the number of types on the pcp lists */
-	MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
+	MIGRATE_FREE = MIGRATE_PCPTYPES,
 #ifdef CONFIG_CMA
 	/*
 	 * MIGRATE_CMA migration type is designed to mimic the way
@@ -88,7 +88,7 @@ static inline bool is_migrate_movable(int mt)
  */
 static inline bool migratetype_is_mergeable(int mt)
 {
-	return mt < MIGRATE_PCPTYPES;
+	return mt < MIGRATE_PCPTYPES || mt == MIGRATE_FREE;
 }
 
 #define for_each_migratetype_order(order, type) \
@@ -138,6 +138,10 @@ enum numa_stat_item {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
+	NR_FREE_UNMOVABLE,
+	NR_FREE_MOVABLE,
+	NR_FREE_RECLAIMABLE,
+	NR_FREE_FREE,
 	NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
 	NR_ZONE_INACTIVE_ANON = NR_ZONE_LRU_BASE,
 	NR_ZONE_ACTIVE_ANON,
@@ -552,23 +556,21 @@ enum zone_watermarks {
 };
 
 /*
- * One per migratetype for each PAGE_ALLOC_COSTLY_ORDER. One additional list
- * for THP which will usually be GFP_MOVABLE. Even if it is another type,
- * it should not contribute to serious fragmentation causing THP allocation
- * failures.
+ * One per migratetype for each PAGE_ALLOC_COSTLY_ORDER. One additional set
+ * for THP (usually GFP_MOVABLE, but with exception of the huge zero page.)
  */
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define NR_PCP_THP 1
+#define NR_PCP_THP MIGRATE_PCPTYPES
 #else
 #define NR_PCP_THP 0
 #endif
 #define NR_LOWORDER_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1))
 #define NR_PCP_LISTS (NR_LOWORDER_PCP_LISTS + NR_PCP_THP)
 
-#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost)
-#define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost)
-#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
-#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
+#define min_wmark_pages(z) (z->_watermark[WMARK_MIN])
+#define low_wmark_pages(z) (z->_watermark[WMARK_LOW])
+#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH])
+#define wmark_pages(z, i) (z->_watermark[i])
 
 /* Fields and list protected by pagesets local_lock in page_alloc.c */
 struct per_cpu_pages {
@@ -707,9 +709,6 @@ struct zone {
 
 	/* zone watermarks, access with *_wmark_pages(zone) macros */
 	unsigned long _watermark[NR_WMARK];
-	unsigned long watermark_boost;
-
-	unsigned long nr_reserved_highatomic;
 
 	/*
 	 * We don't know if the memory that we're going to allocate will be
@@ -884,9 +883,6 @@ enum pgdat_flags {
 };
 
 enum zone_flags {
-	ZONE_BOOSTED_WATERMARK,		/* zone recently boosted watermarks.
-					 * Cleared when kswapd is woken.
-					 */
 	ZONE_RECLAIM_ACTIVE,		/* kswapd may be scanning the zone. */
 };
 

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
@@ -35,26 +35,14 @@ static inline bool is_migrate_isolate(int migratetype)
 
 void set_pageblock_migratetype(struct page *page, int migratetype);
 int move_freepages_block(struct zone *zone, struct page *page,
-				int migratetype, int *num_movable);
-
-/*
- * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
- */
-int
-start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-			 int migratetype, int flags, gfp_t gfp_flags);
-
-/*
- * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
- * target range is [start_pfn, end_pfn)
- */
-void
-undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-			int migratetype);
-
-/*
- * Test all pages in [start_pfn, end_pfn) are isolated or not.
- */
+			 int old_mt, int new_mt, int *num_movable);
+
+int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
+			     int migratetype, int flags, gfp_t gfp_flags);
+
+void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
+			     int migratetype);
+
 int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
 			int isol_flags);
 

diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
@@ -47,8 +47,8 @@ extern unsigned int pageblock_order;
 
 #else /* CONFIG_HUGETLB_PAGE */
 
-/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
-#define pageblock_order		(MAX_ORDER-1)
+/* Manage fragmentation at the 2M level */
+#define pageblock_order		ilog2(2U << (20 - PAGE_SHIFT))
 
 #endif /* CONFIG_HUGETLB_PAGE */
 

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
@@ -481,14 +481,6 @@ static inline void node_stat_sub_folio(struct folio *folio,
 	mod_node_page_state(folio_pgdat(folio), item, -folio_nr_pages(folio));
 }
 
-static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
-					     int migratetype)
-{
-	__mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
-	if (is_migrate_cma(migratetype))
-		__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
-}
-
 extern const char * const vmstat_text[];
 
 static inline const char *zone_stat_name(enum zone_stat_item item)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
@@ -2229,14 +2229,6 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= min_free_kbytes_sysctl_handler,
 		.extra1		= SYSCTL_ZERO,
 	},
-	{
-		.procname	= "watermark_boost_factor",
-		.data		= &watermark_boost_factor,
-		.maxlen		= sizeof(watermark_boost_factor),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= SYSCTL_ZERO,
-	},
 	{
 		.procname	= "watermark_scale_factor",
 		.data		= &watermark_scale_factor,