Skip to content

Commit

Permalink
Merge branch 'huge_alloc' into master
Browse files Browse the repository at this point in the history
mm: page allocator for huge pages

As memory sizes continue go outgrow TLB sizes, huge pages are shifting
from being a nice-to-have optimization for HPC workloads to becoming a
necessity. On Meta's 64G webservers - far from an exotic memory size -
4k pages result in 20% of total CPU cycles being spent on TLB misses.

However, in trying to deploy THP more universally, we observe a
fragmentation problem in the page allocator that routinely prevents
higher order requests from being met quickly, or met at all.

Despite existing defrag efforts in the allocator, such as mobility
grouping and watermark boosting, pages of different migratetypes are
commonly found to be sharing pageblocks. This results in inefficient
or altogether ineffective reclaim/compaction of larger pages.

We also found that this effect isn't necessarily tied to long
uptimes. As an example, only 20min of build load under moderate memory
pressure already results in a significant number of typemixed blocks:

total blocks: 900
unmovable 50
movable 701
reclaimable 149
unmovable blocks with slab/lru pages: 13 ({'slab': 17, 'lru': 19} pages)
movable blocks with non-LRU pages: 77 ({'slab': 4257, 'kmem': 77, 'other': 2} pages)
reclaimable blocks with non-slab pages: 16 ({'lru': 37, 'kmem': 311, 'other': 26} pages)
blocks with nonmovable: 313

For comparison, with this series applied:

total blocks: 900
unmovable 65
movable 457
reclaimable 159
free 219
unmovable blocks with slab/lru pages: 22 ({'slab': 0, 'lru': 38} pages)
movable blocks with non-LRU pages: 0 ({'slab': 0, 'kmem': 0, 'other': 0} pages)
reclaimable blocks with non-slab pages: 3 ({'lru': 36, 'kmem': 0, 'other': 23} pages)
blocks with nonmovable: 266

(The remaining "mixed blocks" in the patched kernel are false
negatives - LRU pages without migrate callbacks (empty_aops e.g.) and
i915 shmem that are pinned until reclaimed through shrinkers.)

<insert some data from the fleet here>

One of the behaviors that sabotage the page allocator's mobility
grouping is the fact that requests of one migratetype are allowed to
fall back into blocks of another type before reclaim and compaction
occur. This is a design decision to prioritize memory utilization over
avoiding block fragmentation - especially considering the history of
lumpy reclaim and its tendency to drastically overreclaim in its
pursuit of contiguity. However, with compaction available, these two
goals are no longer in conflict: the scratch space of free pages for
compaction to work is only twice the size of the allocation request;
in most cases, only small amounts of proactive, coordinated reclaim
and compaction is required to prevent a fallback which may fragment a
pageblock indefinitely.

Another problem lies in how the page allocator drives reclaim and
compaction when it does invoke it. While the page allocator targets
migratetype grouping at the pageblock level, it calls reclaim and
compaction with the order of the allocation request; for order-0
requests, compaction isn't invoked at all. Since many allocations are
smaller than a pageblock, this results in partial block freeing and
subsequent fallbacks and type mixing. By the time a hugepage request
finally does invoke reclaim/compaction for a whole pageblock, the
address space is frequently already fragmented beyond repair.

Note that in combination, these two design decisions have a
self-reinforcing effect on fragmentation: 1. Partially used unmovable
blocks are filled up with fallback movable pages. 2. A subsequent
unmovable allocation, instead of grouping up, will then need to enter
reclaim, which most likely results in a partially freed movable block
that it falls back into. Over time, unmovable allocations are sparsely
scattered throughout the address space and poison most pageblocks.

Reclaim based on request size also means that block fragmentation is
driven by the presence of lower order requests. It is not reliably
mitigated by the mere presence of higher-order requests.

This series proposes to fix the fragmentation issue by aligning the
allocator and reclaim on a common defragmentation block size, and
making pageblocks the base unit for managing free memory.

A neutral pageblock type is introduced, MIGRATE_FREE. The first
allocation to be placed into such a block claims it exclusively for
the allocation's migratetype. Fallbacks from a different type are no
longer allowed, and the block is "kept open" for more allocations of
the same type to ensure tight grouping. A pageblock becomes neutral
again only once all its pages have been freed.

Reclaim and compaction are changed from partial block reclaim to
producing whole neutral page blocks. The watermark logic is adjusted
to apply to neutral blocks, ensuring that background and direct
reclaim always maintain a readily-available reserve of them.

The defragmentation effort changes from reactive to proactive. In
turn, this makes defragmentation actually more efficient: compaction
only has to scan movable blocks and can skip other types entirely;
since movable blocks aren't poisoned by unmovable pages, the chances
of successful compaction in each block are greatly improved as well.

Defragmentation becomes an ongoing responsibility of all allocations,
rather than being the burden of only higher-order asks. This prevents
sub-block allocations - which cause block fragmentation in the first
place - from starving the increasingly important larger requests.

There is a slight increase in worst-case memory overhead by requiring
the watermarks to be met against neutral blocks even when there might
be free pages in typed blocks. However, the high watermarks are less
than 1% of the zone, so the increase is relatively small.

These changes only apply to CONFIG_COMPACTION kernels. Without
compaction, fallbacks and partial block reclaim remain the best
trade-off between utilization and fragmentation.

 Documentation/admin-guide/sysctl/vm.rst |  21 -
 block/bdev.c                            |   2 +-
 include/linux/compaction.h              |   8 +-
 include/linux/gfp.h                     |   2 -
 include/linux/mm.h                      |   1 -
 include/linux/mmzone.h                  |  30 +-
 include/linux/page-isolation.h          |  28 +-
 include/linux/pageblock-flags.h         |   4 +-
 include/linux/vmstat.h                  |   8 -
 kernel/sysctl.c                         |   8 -
 mm/compaction.c                         | 407 ++++++--------
 mm/internal.h                           |  14 +-
 mm/memory_hotplug.c                     |   4 +-
 mm/page_alloc.c                         | 866 +++++++++++++-----------------
 mm/page_isolation.c                     |  42 +-
 mm/vmscan.c                             | 251 +++------
 mm/vmstat.c                             |   6 +-
 17 files changed, 679 insertions(+), 1023 deletions(-)
  • Loading branch information
hnaz committed Mar 9, 2023
2 parents c9c3395 + 8e1c71f commit 57615ce
Show file tree
Hide file tree
Showing 18 changed files with 681 additions and 1,023 deletions.
21 changes: 0 additions & 21 deletions Documentation/admin-guide/sysctl/vm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,6 @@ Currently, these files are in /proc/sys/vm:
- unprivileged_userfaultfd
- user_reserve_kbytes
- vfs_cache_pressure
- watermark_boost_factor
- watermark_scale_factor
- zone_reclaim_mode

Expand Down Expand Up @@ -968,26 +967,6 @@ directory and inode objects. With vfs_cache_pressure=1000, it will look for
ten times more freeable objects than there are.


watermark_boost_factor
======================

This factor controls the level of reclaim when memory is being fragmented.
It defines the percentage of the high watermark of a zone that will be
reclaimed if pages of different mobility are being mixed within pageblocks.
The intent is that compaction has less work to do in the future and to
increase the success rate of future high-order allocations such as SLUB
allocations, THP and hugetlbfs pages.

To make it sensible with respect to the watermark_scale_factor
parameter, the unit is in fractions of 10,000. The default value of
15,000 means that up to 150% of the high watermark will be reclaimed in the
event of a pageblock being mixed due to fragmentation. The level of reclaim
is determined by the number of fragmentation events that occurred in the
recent past. If this value is smaller than a pageblock then a pageblocks
worth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor
of 0 will disable the feature.


watermark_scale_factor
======================

Expand Down
2 changes: 2 additions & 0 deletions arch/x86/kernel/setup.c
Original file line number Diff line number Diff line change
Expand Up @@ -1229,6 +1229,8 @@ void __init setup_arch(char **cmdline_p)

if (boot_cpu_has(X86_FEATURE_GBPAGES))
hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
else
hugetlb_cma_reserve(PMD_SHIFT - PAGE_SHIFT);

/*
* Reserve memory for crash kernel after SRAT is parsed so that it
Expand Down
2 changes: 1 addition & 1 deletion block/bdev.c
Original file line number Diff line number Diff line change
Expand Up @@ -488,7 +488,7 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
inode->i_mode = S_IFBLK;
inode->i_rdev = 0;
inode->i_data.a_ops = &def_blk_aops;
mapping_set_gfp_mask(&inode->i_data, GFP_USER);
mapping_set_gfp_mask(&inode->i_data, GFP_USER|__GFP_MOVABLE);

bdev = I_BDEV(inode);
mutex_init(&bdev->bd_fsfreeze_mutex);
Expand Down
8 changes: 4 additions & 4 deletions include/linux/compaction.h
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ enum compact_priority {
COMPACT_PRIO_SYNC_FULL,
MIN_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_FULL,
COMPACT_PRIO_SYNC_LIGHT,
MIN_COMPACT_COSTLY_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
DEF_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
COMPACT_PRIO_ASYNC,
INIT_COMPACT_PRIORITY = COMPACT_PRIO_ASYNC
Expand Down Expand Up @@ -56,6 +55,7 @@ enum compact_result {
};

struct alloc_context; /* in mm/internal.h */
struct capture_control; /* in mm/internal.h */

/*
* Number of free order-0 pages that should be available above given watermark
Expand Down Expand Up @@ -94,10 +94,10 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
unsigned int order, unsigned int alloc_flags,
const struct alloc_context *ac, enum compact_priority prio,
struct page **page);
struct capture_control *capc);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern enum compact_result compaction_suitable(struct zone *zone, int order,
unsigned int alloc_flags, int highest_zoneidx);
int highest_zoneidx);

extern void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success);
Expand Down Expand Up @@ -187,7 +187,7 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
}

static inline enum compact_result compaction_suitable(struct zone *zone, int order,
int alloc_flags, int highest_zoneidx)
int highest_zoneidx)
{
return COMPACT_SKIPPED;
}
Expand Down
2 changes: 0 additions & 2 deletions include/linux/gfp.h
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,6 @@ static inline int gfp_migratetype(const gfp_t gfp_flags)
BUILD_BUG_ON((1UL << GFP_MOVABLE_SHIFT) != ___GFP_MOVABLE);
BUILD_BUG_ON((___GFP_MOVABLE >> GFP_MOVABLE_SHIFT) != MIGRATE_MOVABLE);
BUILD_BUG_ON((___GFP_RECLAIMABLE >> GFP_MOVABLE_SHIFT) != MIGRATE_RECLAIMABLE);
BUILD_BUG_ON(((___GFP_MOVABLE | ___GFP_RECLAIMABLE) >>
GFP_MOVABLE_SHIFT) != MIGRATE_HIGHATOMIC);

if (unlikely(page_group_by_mobility_disabled))
return MIGRATE_UNMOVABLE;
Expand Down
1 change: 0 additions & 1 deletion include/linux/mm.h
Original file line number Diff line number Diff line change
Expand Up @@ -2746,7 +2746,6 @@ extern void setup_per_cpu_pageset(void);

/* page_alloc.c */
extern int min_free_kbytes;
extern int watermark_boost_factor;
extern int watermark_scale_factor;
extern bool arch_has_descending_max_zone_pfns(void);

Expand Down
30 changes: 13 additions & 17 deletions include/linux/mmzone.h
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ enum migratetype {
MIGRATE_MOVABLE,
MIGRATE_RECLAIMABLE,
MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
MIGRATE_FREE = MIGRATE_PCPTYPES,
#ifdef CONFIG_CMA
/*
* MIGRATE_CMA migration type is designed to mimic the way
Expand Down Expand Up @@ -88,7 +88,7 @@ static inline bool is_migrate_movable(int mt)
*/
static inline bool migratetype_is_mergeable(int mt)
{
return mt < MIGRATE_PCPTYPES;
return mt < MIGRATE_PCPTYPES || mt == MIGRATE_FREE;
}

#define for_each_migratetype_order(order, type) \
Expand Down Expand Up @@ -138,6 +138,10 @@ enum numa_stat_item {
enum zone_stat_item {
/* First 128 byte cacheline (assuming 64 bit words) */
NR_FREE_PAGES,
NR_FREE_UNMOVABLE,
NR_FREE_MOVABLE,
NR_FREE_RECLAIMABLE,
NR_FREE_FREE,
NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
NR_ZONE_INACTIVE_ANON = NR_ZONE_LRU_BASE,
NR_ZONE_ACTIVE_ANON,
Expand Down Expand Up @@ -552,23 +556,21 @@ enum zone_watermarks {
};

/*
* One per migratetype for each PAGE_ALLOC_COSTLY_ORDER. One additional list
* for THP which will usually be GFP_MOVABLE. Even if it is another type,
* it should not contribute to serious fragmentation causing THP allocation
* failures.
* One per migratetype for each PAGE_ALLOC_COSTLY_ORDER. One additional set
* for THP (usually GFP_MOVABLE, but with exception of the huge zero page.)
*/
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define NR_PCP_THP 1
#define NR_PCP_THP MIGRATE_PCPTYPES
#else
#define NR_PCP_THP 0
#endif
#define NR_LOWORDER_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1))
#define NR_PCP_LISTS (NR_LOWORDER_PCP_LISTS + NR_PCP_THP)

#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost)
#define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost)
#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
#define min_wmark_pages(z) (z->_watermark[WMARK_MIN])
#define low_wmark_pages(z) (z->_watermark[WMARK_LOW])
#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH])
#define wmark_pages(z, i) (z->_watermark[i])

/* Fields and list protected by pagesets local_lock in page_alloc.c */
struct per_cpu_pages {
Expand Down Expand Up @@ -707,9 +709,6 @@ struct zone {

/* zone watermarks, access with *_wmark_pages(zone) macros */
unsigned long _watermark[NR_WMARK];
unsigned long watermark_boost;

unsigned long nr_reserved_highatomic;

/*
* We don't know if the memory that we're going to allocate will be
Expand Down Expand Up @@ -884,9 +883,6 @@ enum pgdat_flags {
};

enum zone_flags {
ZONE_BOOSTED_WATERMARK, /* zone recently boosted watermarks.
* Cleared when kswapd is woken.
*/
ZONE_RECLAIM_ACTIVE, /* kswapd may be scanning the zone. */
};

Expand Down
28 changes: 8 additions & 20 deletions include/linux/page-isolation.h
Original file line number Diff line number Diff line change
Expand Up @@ -35,26 +35,14 @@ static inline bool is_migrate_isolate(int migratetype)

void set_pageblock_migratetype(struct page *page, int migratetype);
int move_freepages_block(struct zone *zone, struct page *page,
int migratetype, int *num_movable);

/*
* Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
*/
int
start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
int migratetype, int flags, gfp_t gfp_flags);

/*
* Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
* target range is [start_pfn, end_pfn)
*/
void
undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
int migratetype);

/*
* Test all pages in [start_pfn, end_pfn) are isolated or not.
*/
int old_mt, int new_mt, int *num_movable);

int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
int migratetype, int flags, gfp_t gfp_flags);

void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
int migratetype);

int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
int isol_flags);

Expand Down
4 changes: 2 additions & 2 deletions include/linux/pageblock-flags.h
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,8 @@ extern unsigned int pageblock_order;

#else /* CONFIG_HUGETLB_PAGE */

/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
#define pageblock_order (MAX_ORDER-1)
/* Manage fragmentation at the 2M level */
#define pageblock_order ilog2(2U << (20 - PAGE_SHIFT))

#endif /* CONFIG_HUGETLB_PAGE */

Expand Down
8 changes: 0 additions & 8 deletions include/linux/vmstat.h
Original file line number Diff line number Diff line change
Expand Up @@ -481,14 +481,6 @@ static inline void node_stat_sub_folio(struct folio *folio,
mod_node_page_state(folio_pgdat(folio), item, -folio_nr_pages(folio));
}

static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
int migratetype)
{
__mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
if (is_migrate_cma(migratetype))
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
}

extern const char * const vmstat_text[];

static inline const char *zone_stat_name(enum zone_stat_item item)
Expand Down
8 changes: 0 additions & 8 deletions kernel/sysctl.c
Original file line number Diff line number Diff line change
Expand Up @@ -2229,14 +2229,6 @@ static struct ctl_table vm_table[] = {
.proc_handler = min_free_kbytes_sysctl_handler,
.extra1 = SYSCTL_ZERO,
},
{
.procname = "watermark_boost_factor",
.data = &watermark_boost_factor,
.maxlen = sizeof(watermark_boost_factor),
.mode = 0644,
.proc_handler = proc_dointvec_minmax,
.extra1 = SYSCTL_ZERO,
},
{
.procname = "watermark_scale_factor",
.data = &watermark_scale_factor,
Expand Down
Loading

0 comments on commit 57615ce

Please sign in to comment.