-
Notifications
You must be signed in to change notification settings - Fork 442
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
L0 flush: opt-in mechanism to bypass PageCache reads and writes (#8190)
part of #7418 # Motivation (reproducing #7418) When we do an `InMemoryLayer::write_to_disk`, there is a tremendous amount of random read I/O, as deltas from the ephemeral file (written in LSN order) are written out to the delta layer in key order. In benchmarks (#7409) we can see that this delta layer writing phase is substantially more expensive than the initial ingest of data, and that within the delta layer write a significant amount of the CPU time is spent traversing the page cache. # High-Level Changes Add a new mode for L0 flush that works as follows: * Read the full ephemeral file into memory -- layers are much smaller than total memory, so this is afforable * Do all the random reads directly from this in memory buffer instead of using blob IO/page cache/disk reads. * Add a semaphore to limit how many timelines may concurrently do this (limit peak memory). * Make the semaphore configurable via PS config. # Implementation Details The new `BlobReaderRef::Slice` is a temporary hack until we can ditch `blob_io` for `InMemoryLayer` => Plan for this is laid out in #8183 # Correctness The correctness of this change is quite obvious to me: we do what we did before (`blob_io`) but read from memory instead of going to disk. The highest bug potential is in doing owned-buffers IO. I refactored the API a bit in preliminary PR #8186 to make it less error-prone, but still, careful review is requested. # Performance I manually measured single-client ingest performance from `pgbench -i ...`. Full report: https://neondatabase.notion.site/2024-06-28-benchmarking-l0-flush-performance-e98cff3807f94cb38f2054d8c818fe84?pvs=4 tl;dr: * no speed improvements during ingest, but * significantly lower pressure on PS PageCache (eviction rate drops to 1/3) * (that's why I'm working on this) * noticable but modestly lower CPU time This is good enough for merging this PR because the changes require opt-in. We'll do more testing in staging & pre-prod. # Stability / Monitoring **memory consumption**: there's no _hard_ limit on max `InMemoryLayer` size (aka "checkpoint distance") , hence there's no hard limit on the memory allocation we do for flushing. In practice, we a) [log a warning](https://github.com/neondatabase/neon/blob/23827c6b0d400cbb9a972d4d05d49834816c40d1/pageserver/src/tenant/timeline.rs#L5741-L5743) when we flush oversized layers, so we'd know which tenant is to blame and b) if we were to put a hard limit in place, we would have to decide what to do if there is an InMemoryLayer that exceeds the limit. It seems like a better option to guarantee a max size for frozen layer, dependent on `checkpoint_distance`. Then limit concurrency based on that. **metrics**: we do have the [flush_time_histo](https://github.com/neondatabase/neon/blob/23827c6b0d400cbb9a972d4d05d49834816c40d1/pageserver/src/tenant/timeline.rs#L3725-L3726), but that includes the wait time for the semaphore. We could add a separate metric for the time spent after acquiring the semaphore, so one can infer the wait time. Seems unnecessary at this point, though.
- Loading branch information
Showing
12 changed files
with
323 additions
and
58 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
use std::{num::NonZeroUsize, sync::Arc}; | ||
|
||
use crate::tenant::ephemeral_file; | ||
|
||
#[derive(Default, Debug, PartialEq, Eq, Clone, serde::Deserialize)] | ||
#[serde(tag = "mode", rename_all = "kebab-case", deny_unknown_fields)] | ||
pub enum L0FlushConfig { | ||
#[default] | ||
PageCached, | ||
#[serde(rename_all = "snake_case")] | ||
Direct { max_concurrency: NonZeroUsize }, | ||
} | ||
|
||
#[derive(Clone)] | ||
pub struct L0FlushGlobalState(Arc<Inner>); | ||
|
||
pub(crate) enum Inner { | ||
PageCached, | ||
Direct { semaphore: tokio::sync::Semaphore }, | ||
} | ||
|
||
impl L0FlushGlobalState { | ||
pub fn new(config: L0FlushConfig) -> Self { | ||
match config { | ||
L0FlushConfig::PageCached => Self(Arc::new(Inner::PageCached)), | ||
L0FlushConfig::Direct { max_concurrency } => { | ||
let semaphore = tokio::sync::Semaphore::new(max_concurrency.get()); | ||
Self(Arc::new(Inner::Direct { semaphore })) | ||
} | ||
} | ||
} | ||
|
||
pub(crate) fn inner(&self) -> &Arc<Inner> { | ||
&self.0 | ||
} | ||
} | ||
|
||
impl L0FlushConfig { | ||
pub(crate) fn prewarm_on_write(&self) -> ephemeral_file::PrewarmPageCacheOnWrite { | ||
use L0FlushConfig::*; | ||
match self { | ||
PageCached => ephemeral_file::PrewarmPageCacheOnWrite::Yes, | ||
Direct { .. } => ephemeral_file::PrewarmPageCacheOnWrite::No, | ||
} | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
5de896e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3088 tests run: 2964 passed, 0 failed, 124 skipped (full report)
Code coverage* (full report)
functions
:32.7% (6934 of 21213 functions)
lines
:50.0% (54333 of 108576 lines)
* collected from Rust tests only
5de896e at 2024-07-02T16:03:47.315Z :recycle: