bypass PageCache for `InMemoryLayer::get_values_reconstruct_data` #8183

problame · 2024-06-27T12:34:02Z

part of epic #7386

bit of prior discussion in https://neondb.slack.com/archives/C033RQ5SPDH/p1719411245662839

InMemoryLayer::get_values_reconstruct_data uses read_blob, which internally uses the PageCache for block access.

Switch it to vectored reads that bypass the PageCache.

However, we want to deliver equivalent performance compared to the current code in the case where the current code, in one call, reads multiple blobs from the same 8kb EphemeralFile page.

Strategy for this (planned together with @VladLazar ):

store the blob lengths in the in-memory btree

avoid consuming more memory space by using u32 instead of u64 for offset. u32 is enough if we cap EphemeralFile to 4GiB, which is way larger than we want it to go anyways 3.

Get rid of the whole blob_io business for InMemoryLayer, we don't need it if we store offset and length in the in-memory index.
For get_values_reconstruct_data, feed the (offset, length) pairs directly into the VectoredReadBuilder (after sorting them in offset order, so the builder can merge adjacent blob reads as needed)

Tasks

Give feedback

refactor(write path): newtype to enforce use of fully initialized slices #8717
bypass PageCache for InMemoryLayer + avoid Value::deser on L0 flush #8537
extraordinary rollout to pre-prod & observe benchmark results
rollout to prod & observe page cache dashboard
Options

The text was updated successfully, but these errors were encountered:

part of #7418 # Motivation (reproducing #7418) When we do an `InMemoryLayer::write_to_disk`, there is a tremendous amount of random read I/O, as deltas from the ephemeral file (written in LSN order) are written out to the delta layer in key order. In benchmarks (#7409) we can see that this delta layer writing phase is substantially more expensive than the initial ingest of data, and that within the delta layer write a significant amount of the CPU time is spent traversing the page cache. # High-Level Changes Add a new mode for L0 flush that works as follows: * Read the full ephemeral file into memory -- layers are much smaller than total memory, so this is afforable * Do all the random reads directly from this in memory buffer instead of using blob IO/page cache/disk reads. * Add a semaphore to limit how many timelines may concurrently do this (limit peak memory). * Make the semaphore configurable via PS config. # Implementation Details The new `BlobReaderRef::Slice` is a temporary hack until we can ditch `blob_io` for `InMemoryLayer` => Plan for this is laid out in #8183 # Correctness The correctness of this change is quite obvious to me: we do what we did before (`blob_io`) but read from memory instead of going to disk. The highest bug potential is in doing owned-buffers IO. I refactored the API a bit in preliminary PR #8186 to make it less error-prone, but still, careful review is requested. # Performance I manually measured single-client ingest performance from `pgbench -i ...`. Full report: https://neondatabase.notion.site/2024-06-28-benchmarking-l0-flush-performance-e98cff3807f94cb38f2054d8c818fe84?pvs=4 tl;dr: * no speed improvements during ingest, but * significantly lower pressure on PS PageCache (eviction rate drops to 1/3) * (that's why I'm working on this) * noticable but modestly lower CPU time This is good enough for merging this PR because the changes require opt-in. We'll do more testing in staging & pre-prod. # Stability / Monitoring **memory consumption**: there's no _hard_ limit on max `InMemoryLayer` size (aka "checkpoint distance") , hence there's no hard limit on the memory allocation we do for flushing. In practice, we a) [log a warning](https://github.com/neondatabase/neon/blob/23827c6b0d400cbb9a972d4d05d49834816c40d1/pageserver/src/tenant/timeline.rs#L5741-L5743) when we flush oversized layers, so we'd know which tenant is to blame and b) if we were to put a hard limit in place, we would have to decide what to do if there is an InMemoryLayer that exceeds the limit. It seems like a better option to guarantee a max size for frozen layer, dependent on `checkpoint_distance`. Then limit concurrency based on that. **metrics**: we do have the [flush_time_histo](https://github.com/neondatabase/neon/blob/23827c6b0d400cbb9a972d4d05d49834816c40d1/pageserver/src/tenant/timeline.rs#L3725-L3726), but that includes the wait time for the semaphore. We could add a separate metric for the time spent after acquiring the semaphore, so one can infer the wait time. Seems unnecessary at this point, though.

problame · 2024-08-26T09:00:24Z

This week:

last touches on bypass PageCache for InMemoryLayer + avoid Value::deser on L0 flush #8537
second review from John
merge by Wednesday morning
soak in staging until next week

Next week:

perf qualification in pre-prod
- Can we do this earlier, e.g., by deploying a main build to pre-prod during this week?
global rollout

koivunej · 2024-08-26T13:59:37Z

Additional extraordinary deploy to eu-west-1 before global rollout.

problame · 2024-09-02T11:44:57Z

Results from pre-prod:

extraordinary deploy to eu-west-1 happened on Friday Aug 30 ~18:00 UTC, right before the prod-like cloudbench start
=> no WARN or ERR in the logs
=> getpage latencies unchanged
=> pageserver page cache records no more InMemoryLayer accesses (see screenshot below)
no measurable impact on overall page cache access or miss rate (expected)
WalReceiverConnectionHandler and MgmtRequest and InitialLogicalSizeCalculation benefitted but they're all low traffic

problame · 2024-09-05T10:54:57Z

production rollout => no more InMemoryLayer pages in PS PageCache (dashboard)

No significant impact on overall PS PageCache performance due to the small role that InMemoryLayer plays generally in terms of access rate.

The improvements in PS PageCache performance are due to #8184 which rolled out to ap-southeast-* this week.

See this thread for details: https://neondb.slack.com/archives/C033RQ5SPDH/p1725530975830579

problame mentioned this issue Jun 27, 2024

Epic: Bypass PageCache for user data blocks #7386

Open

5 tasks

problame self-assigned this Jun 27, 2024

problame changed the title ~~eliminate read-path PageCach'ing of InMemoryLayer blocks~~ bypass PageCache for InMemoryLayer::get_values_reconstruct_data Jun 27, 2024

problame mentioned this issue Jun 27, 2024

L0 flush: opt-in mechanism to bypass PageCache reads and writes #8190

Merged

This was referenced Aug 13, 2024

refactor(write path): newtype to enforce use of fully initialized slices #8717

Merged

bypass PageCache for InMemoryLayer + avoid Value::deser on L0 flush #8537

Merged

problame closed this as completed Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bypass PageCache for `InMemoryLayer::get_values_reconstruct_data` #8183

bypass PageCache for `InMemoryLayer::get_values_reconstruct_data` #8183

problame commented Jun 27, 2024 •

edited

Loading

Tasks

problame commented Aug 26, 2024

koivunej commented Aug 26, 2024

problame commented Sep 2, 2024

problame commented Sep 5, 2024

bypass PageCache for InMemoryLayer::get_values_reconstruct_data #8183

bypass PageCache for InMemoryLayer::get_values_reconstruct_data #8183

Comments

problame commented Jun 27, 2024 • edited Loading

Tasks

problame commented Aug 26, 2024

koivunej commented Aug 26, 2024

problame commented Sep 2, 2024

problame commented Sep 5, 2024

bypass PageCache for `InMemoryLayer::get_values_reconstruct_data` #8183

bypass PageCache for `InMemoryLayer::get_values_reconstruct_data` #8183

problame commented Jun 27, 2024 •

edited

Loading