bypass PageCache for L0 flush #7418

jcsp · 2024-04-18T09:51:35Z

Currently, when we do an InMemoryLayer::write_to_disk, there is a tremendous amount of random read I/O, as deltas from the ephemeral file (written in LSN order) are written out to the delta layer in key order.

In benchmarks (#7409) we can see that this delta layer writing phase is substantially more expensive than the initial ingest of data, and that within the delta layer write a significant amount of the CPU time is spent traversing the page cache.

It's really slow: like tens of megabytes per second on a fast desktop CPU.

Since this is a background task whose concurrency we can limit, we can simplify and accelerate this by doing the whole thing in memory:

Read the full ephemeral file into memory -- layers are much smaller than total memory, so this is afforable
Do all the random reads directly from this in memory buffer instead of using blob IO/page cache/disk reads.
Add a semaphore to limit how many timelines may concurrently do this (limit peak memory). Set this to ~the number of cores, or some factor of the system memory / layer size, which ever is lower.

Follow-ups:

l0_flush.mode=direct: do we need configurability / what do we actually want to configure? #8894

The text was updated successfully, but these errors were encountered:

@jcsp

part of #7124 # Problem (Re-stating the problem from #7124 for posterity) The `test_bulk_ingest` benchmark shows about 2x lower throughput with `tokio-epoll-uring` compared to `std-fs`. That's why we temporarily disabled it in #7238. The reason for this regression is that the benchmark runs on a system without memory pressure and thus std-fs writes don't block on disk IO but only copy the data into the kernel page cache. `tokio-epoll-uring` cannot beat that at this time, and possibly never. (However, under memory pressure, std-fs would stall the executor thread on kernel page cache writeback disk IO. That's why we want to use `tokio-epoll-uring`. And we likely want to use O_DIRECT in the future, at which point std-fs becomes an absolute show-stopper.) More elaborate analysis: https://neondatabase.notion.site/Why-test_bulk_ingest-is-slower-with-tokio-epoll-uring-918c5e619df045a7bd7b5f806cfbd53f?pvs=4 # Changes This PR increases the buffer size of `blob_io` and `EphemeralFile` from PAGE_SZ=8k to 64k. Longer-term, we probably want to do double-buffering / pipelined IO. # Resource Usage We currently do not flush the buffer when freezing the InMemoryLayer. That means a single Timeline can have multiple 64k buffers alive, esp if flushing is slow. This poses an OOM risk. We should either bound the number of frozen layers (#7317). Or we should change the freezing code to flush the buffer and drop the allocation. However, that's future work. # Performance (Measurements done on i3en.3xlarge.) The `test_bulk_insert.py` is too noisy, even with instance storage. It varies by 30-40%. I suspect that's due to compaction. Raising amount of data by 10x doesn't help with the noisiness.) So, I used the `bench_ingest` from @jcsp 's #7409 . Specifically, the `ingest-small-values/ingest 128MB/100b seq` and `ingest-small-values/ingest 128MB/100b seq, no delta` benchmarks. | | | seq | seq, no delta | |-----|-------------------|-----|---------------| | 8k | std-fs | 55 | 165 | | 8k | tokio-epoll-uring | 37 | 107 | | 64k | std-fs | 55 | 180 | | 64k | tokio-epoll-uring | 48 | 164 | The `8k` is from before this PR, the `64k` is with this PR. The values are the throughput reported by the benchmark (MiB/s). We see that this PR gets `tokio-epoll-uring` from 67% to 87% of `std-fs` performance in the `seq` benchmark. Notably, `seq` appears to hit some other bottleneck at `55 MiB/s`. CC'ing #7418 due to the apparent bottlenecks in writing delta layers. For `seq, no delta`, this PR gets `tokio-epoll-uring` from 64% to 91% of `std-fs` performance.

…e is empty Found this while doing research for #7418

…8154) We only use `keys` to check if it's empty so we can bail out early. No need to collect the keys for that. Found this while doing research for #7418

…at_n`, fix UB for engine `std-fs` (#8186) part of #7418 I reviewed how the VirtualFile API's `read` methods look like and came to the conclusion that we've been using `IoBufMut` / `BoundedBufMut` / `Slice` wrong. This patch rectifies the situation. # Change 1: take `tokio_epoll_uring::Slice` in the read APIs Before, we took an `IoBufMut`, which is too low of a primitive and while it _seems_ convenient to be able to pass in a `Vec<u8>` without any fuzz, it's actually very unclear at the callsite that we're going to fill up that `Vec` up to its `capacity()`, because that's what `IoBuf::bytes_total()` returns and that's what `VirtualFile::read_exact_at` fills. By passing a `Slice` instead, a caller that "just wants to read into a `Vec`" is forced to be explicit about it, adding either `slice_full()` or `slice(x..y)`, and these methods panic if the read is outside of the bounds of the `Vec::capacity()`. Last, passing slices is more similar to what the `std::io` APIs look like. # Change 2: fix UB in `virtual_file_io_engine=std-fs` While reviewing call sites, I noticed that the `io_engine::IoEngine::read_at` method for `StdFs` mode has been constructing an `&mut[u8]` from raw parts that were uninitialized. We then used `std::fs::File::read_exact` to initialize that memory, but, IIUC we must not even be constructing an `&mut[u8]` where some of the memory isn't initialized. So, stop doing that and add a helper ext trait on `Slice` to do the zero-initialization. # Change 3: eliminate `read_exact_at_n` The `read_exact_at_n` doesn't make sense because the caller can just 1. `slice = buf.slice()` the exact memory it wants to fill 2. `slice = read_exact_at(slice)` 3. `buf = slice.into_inner()` Again, the `std::io` APIs specify the length of the read via the Rust slice length. We should do the same for the owned buffers IO APIs, i.e., via `Slice::bytes_total()`. # Change 4: simplify filling of `PageWriteGuard` The `PageWriteGuardBuf::init_up_to` was never necessary. Remove it. See changes to doc comment for more details. --- Reviewers should probably look at the added test case first, it illustrates my case a bit.

part of #7418 # Motivation (reproducing #7418) When we do an `InMemoryLayer::write_to_disk`, there is a tremendous amount of random read I/O, as deltas from the ephemeral file (written in LSN order) are written out to the delta layer in key order. In benchmarks (#7409) we can see that this delta layer writing phase is substantially more expensive than the initial ingest of data, and that within the delta layer write a significant amount of the CPU time is spent traversing the page cache. # High-Level Changes Add a new mode for L0 flush that works as follows: * Read the full ephemeral file into memory -- layers are much smaller than total memory, so this is afforable * Do all the random reads directly from this in memory buffer instead of using blob IO/page cache/disk reads. * Add a semaphore to limit how many timelines may concurrently do this (limit peak memory). * Make the semaphore configurable via PS config. # Implementation Details The new `BlobReaderRef::Slice` is a temporary hack until we can ditch `blob_io` for `InMemoryLayer` => Plan for this is laid out in #8183 # Correctness The correctness of this change is quite obvious to me: we do what we did before (`blob_io`) but read from memory instead of going to disk. The highest bug potential is in doing owned-buffers IO. I refactored the API a bit in preliminary PR #8186 to make it less error-prone, but still, careful review is requested. # Performance I manually measured single-client ingest performance from `pgbench -i ...`. Full report: https://neondatabase.notion.site/2024-06-28-benchmarking-l0-flush-performance-e98cff3807f94cb38f2054d8c818fe84?pvs=4 tl;dr: * no speed improvements during ingest, but * significantly lower pressure on PS PageCache (eviction rate drops to 1/3) * (that's why I'm working on this) * noticable but modestly lower CPU time This is good enough for merging this PR because the changes require opt-in. We'll do more testing in staging & pre-prod. # Stability / Monitoring **memory consumption**: there's no _hard_ limit on max `InMemoryLayer` size (aka "checkpoint distance") , hence there's no hard limit on the memory allocation we do for flushing. In practice, we a) [log a warning](https://github.com/neondatabase/neon/blob/23827c6b0d400cbb9a972d4d05d49834816c40d1/pageserver/src/tenant/timeline.rs#L5741-L5743) when we flush oversized layers, so we'd know which tenant is to blame and b) if we were to put a hard limit in place, we would have to decide what to do if there is an InMemoryLayer that exceeds the limit. It seems like a better option to guarantee a max size for frozen layer, dependent on `checkpoint_distance`. Then limit concurrency based on that. **metrics**: we do have the [flush_time_histo](https://github.com/neondatabase/neon/blob/23827c6b0d400cbb9a972d4d05d49834816c40d1/pageserver/src/tenant/timeline.rs#L3725-L3726), but that includes the wait time for the semaphore. We could add a separate metric for the time spent after acquiring the semaphore, so one can infer the wait time. Seems unnecessary at this point, though.

problame · 2024-07-15T13:50:37Z

This week: investigate staging OOMs

problame · 2024-07-15T14:01:03Z

Updated plan: don't spend much time investigating OOMs this week, instead progress coding work on the parent epic.

So: this week, disable l0_flush.mode=direct in staging.
Then next week see if we had any more OOMs or not.
If not, then it's another proof point that l0_flush.mode=direct is responsible for the OOMs.

problame · 2024-07-22T08:54:57Z

The OOMs were found to not be due to l0_flush.mode=direct. So, re-enabling in staging & pre-prod this week.

problame · 2024-07-29T09:43:41Z

aws.git commit that enabled staging & pre-prod:

https://github.com/neondatabase/aws/commit/cfda0172ba6e91eda7ceb70e5c365f88026d6aa2

merged Jul 22

first pre-prod prodlike cloudbench run that hit the new configuration was on evening of Jul 23

Behaved as expected & no significant impact to max RSS

Testing in staging and pre-prod has been [going well](#7418 (comment)). This PR enables mode=direct by default, thereby providing additional coverage in the automated tests: - Rust tests - Integration tests - Nightly pagebench (likely irrelevant because it's read-only)

problame · 2024-07-29T10:37:21Z

This week:

l0_flush: use mode=direct by default => coverage in automated tests #8534
One prod region?

…8534) Testing in staging and pre-prod has been [going well](#7418 (comment)). This PR enables mode=direct by default, thereby providing additional coverage in the automated tests: - Rust tests - Integration tests - Nightly pagebench (likely irrelevant because it's read-only) Production deployments continue to use `mode=page-cache` for the time being: neondatabase/infra#1655 refs #7418

problame · 2024-07-31T15:11:46Z

Next week:

rollout to https://github.com/neondatabase/aws/pull/1671

…8534) Testing in staging and pre-prod has been [going well](#7418 (comment)). This PR enables mode=direct by default, thereby providing additional coverage in the automated tests: - Rust tests - Integration tests - Nightly pagebench (likely irrelevant because it's read-only) Production deployments continue to use `mode=page-cache` for the time being: neondatabase/infra#1655 refs #7418

problame · 2024-08-16T07:43:49Z

Status update:

l0_flush.mode=direct rolled out everywhere ; last 3 regions happened yesterday
- Graphs in Slack

It's been rolled out everywhere, no configs are referencing it. All code that's made dead by the removal of the config option is removed as part of this PR. The `page_caching::PreWarmingWriter` in `::No` mode is equivalent to a `size_tracking_writer`, so, use that. part of #7418

problame · 2024-08-19T14:39:44Z

To be determined before closing this issue:

Do we want to retain the configurability of the concurrency limit?
Do we want to invest into more "desired" state configurability, i.e., not just a concurrency limit but a "anticipated concurrent memory usage" limit?
If neither, let's remove the config option.

It's been rolled out everywhere, no configs are referencing it. All code that's made dead by the removal of the config option is removed as part of this PR. The `page_caching::PreWarmingWriter` in `::No` mode is equivalent to a `size_tracking_writer`, so, use that. part of #7418

problame · 2024-08-27T09:36:47Z

Decision yesterday: leave the option until after the ARM transition is complete, then re-evaluate.

problame · 2024-09-02T15:35:11Z

Decision yesterday: leave the option until after the ARM transition is complete, then re-evaluate.

This moves into a follow-up issue: #8894

jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Apr 18, 2024

This was referenced Apr 25, 2024

Epic: Bypass PageCache for user data blocks #7386

Open

perf!: use larger buffers for blob_io and ephemeral_file #7485

Merged

problame mentioned this issue Apr 26, 2024

test_bulk_insert / walingest generally is slower with tokio-epoll-uring #7124

Closed

2 tasks

problame added a commit that referenced this issue Jun 25, 2024

inmemory layer flush: avoid allocation when checking whether key_rang…

fcf5518

…e is empty Found this while doing research for #7418

problame mentioned this issue Jun 25, 2024

L0 flush: avoid short-lived allocation when checking key_range empty #8154

Merged

problame changed the title ~~pageserver: fast delta layer writes~~ bypass PageCache for l0 flush Jun 27, 2024

problame changed the title ~~bypass PageCache for l0 flush~~ bypass PageCache for L0 flush Jun 27, 2024

problame self-assigned this Jun 27, 2024

This was referenced Jun 27, 2024

virtual_file: take a Slice in the read APIs, eliminate read_exact_at_n, fix UB for engine std-fs #8186

Merged

L0 flush: opt-in mechanism to bypass PageCache reads and writes #8190

Merged

jcsp mentioned this issue Jul 22, 2024

pageserver: 2x ingest performance #8452

Closed

problame mentioned this issue Jul 29, 2024

l0_flush: use mode=direct by default => coverage in automated tests #8534

Merged

This was referenced Aug 13, 2024

bypass PageCache for InMemoryLayer + avoid Value::deser on L0 flush #8537

Merged

l0_flush: remove support for mode page-cached #8739

Merged

problame mentioned this issue Sep 2, 2024

l0_flush.mode=direct: do we need configurability / what do we actually want to configure? #8894

Open

problame closed this as completed Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bypass PageCache for L0 flush #7418

bypass PageCache for L0 flush #7418

jcsp commented Apr 18, 2024 •

edited by problame

Loading

Impl

problame commented Jul 15, 2024

problame commented Jul 15, 2024

problame commented Jul 22, 2024

problame commented Jul 29, 2024

problame commented Jul 29, 2024

problame commented Jul 31, 2024

problame commented Aug 16, 2024 •

edited

Loading

problame commented Aug 19, 2024 •

edited

Loading

problame commented Aug 27, 2024

problame commented Sep 2, 2024

bypass PageCache for L0 flush #7418

bypass PageCache for L0 flush #7418

Comments

jcsp commented Apr 18, 2024 • edited by problame Loading

Impl

problame commented Jul 15, 2024

problame commented Jul 15, 2024

problame commented Jul 22, 2024

problame commented Jul 29, 2024

problame commented Jul 29, 2024

problame commented Jul 31, 2024

problame commented Aug 16, 2024 • edited Loading

problame commented Aug 19, 2024 • edited Loading

problame commented Aug 27, 2024

problame commented Sep 2, 2024

jcsp commented Apr 18, 2024 •

edited by problame

Loading

problame commented Aug 16, 2024 •

edited

Loading

problame commented Aug 19, 2024 •

edited

Loading