bypass `PageCache` for `InMemoryLayer` + avoid `Value::deser` on L0 flush #8537

problame · 2024-07-29T11:26:14Z

Part of Epic: Bypass PageCache for user data blocks.

Problem

InMemoryLayer still uses the PageCache for all data stored in the VirtualFile that underlies the EphemeralFile.

Background

Before this PR, EphemeralFile is a fancy and (code-bloated) buffered writer around a VirtualFile that supports blob_io.

The InMemoryLayerInner::index stores offsets into the EphemeralFile.
At those offset, we find a varint length followed by the serialized Value.

Vectored reads (get_values_reconstruct_data) are not in fact vectored - each Value that needs to be read is read sequentially.

The will_init bit of information which we use to early-exit the get_values_reconstruct_data for a given key is stored in the serialized Value, meaning we have to read & deserialize the Value from the EphemeralFile.

The L0 flushing also needs to re-determine the will_init bit of information, by deserializing each value during L0 flush.

Changes

Store the value length and will_init information in the InMemoryLayer::index. The EphemeralFile thus only needs to store the values.
For get_values_reconstruct_data:

Use the in-memory index figures out which values need to be read. Having the will_init stored in the index enables us to do that.
View the EphemeralFile as a byte array of "DIO chunks", each 512 bytes in size (adjustable constant). A "DIO chunk" is the minimal unit that we can read under direct IO.
Figure out which chunks need to be read to retrieve the serialized bytes for thes values we need to read.
Coalesce chunk reads such that each DIO chunk is only read once to serve all value reads that need data from that chunk.
Merge adjacent chunk reads into larger EphemeralFile::read_exact_at_eof_ok of up to 128k (adjustable constant).

The new EphemeralFile::read_exact_at_eof_ok fills the IO buffer from the underlying VirtualFile and/or its in-memory buffer.
The L0 flush code is changed to use the index directly, blob_io
We can remove the ephemeral_file::page_caching construct now.

The get_values_reconstruct_data changes seem like a bit overkill but they are necessary so we issue the equivalent amount of read system calls compared to before this PR where it was highly likely that even if the first PageCache access was a miss, remaining reads within the same get_values_reconstruct_data call from the same EphemeralFile page were a hit.

The "DIO chunk" stuff is truly unnecessary for page cache bypass, but, since we're working on direct IO and #8719 specifically, we need to do something like this anyways in the near future.

Alternative Design

The original plan was to use the vectored_blob_io code it relies on the invariant of Delta&Image layers that index order == values order.

Further, vectored_blob_io code's strategy for merging IOs is limited to adjacent reads. However, with direct IO, there is another level of merging that should be done, specifically, if multiple reads map to the same "DIO chunk" (=alignment-requirement-sized and -aligned region of the file), then it's "free" to read the chunk into an IO buffer and serve the two reads from that buffer.
=> #8719

Testing / Performance

Correctness of the IO merging code is ensured by unit tests.

Additionally, minimal tests are added for the EphemeralFile implementation and the bit-packed InMemoryLayerIndexValue.

Performance testing results are presented below.
All pref testing done on my M2 MacBook Pro, running a Linux VM.
It's a release build without --features testing.

We see definitive improvement in ingest performance microbenchmark and an ad-hoc microbenchmark for getpage against InMemoryLayer.

baseline: commit 7c74112b2a6e23c07bfd9cc62c240cd6bbdd3bd9 origin/main
HEAD: ef1c55c52e0c313be4d302794d29534591f9cdc5

cargo bench --bench bench_ingest -- 'ingest 128MB/100b seq, no delta'

baseline

ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [483.50 ms 498.73 ms 522.53 ms]
                        thrpt:  [244.96 MiB/s 256.65 MiB/s 264.73 MiB/s]

HEAD

ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [479.22 ms 482.92 ms 487.35 ms]
                        thrpt:  [262.64 MiB/s 265.06 MiB/s 267.10 MiB/s]

We don't have a micro-benchmark for InMemoryLayer and it's quite cumbersome to add one. So, I did manual testing in neon_local.


  ./target/release/neon_local stop
  rm -rf .neon
  ./target/release/neon_local init
  ./target/release/neon_local start
  ./target/release/neon_local tenant create --set-default
  ./target/release/neon_local endpoint create foo
  ./target/release/neon_local endpoint start foo
  psql 'postgresql://[email protected]:55432/postgres'
psql (13.16 (Debian 13.16-0+deb11u1), server 15.7)

CREATE TABLE wal_test (
    id SERIAL PRIMARY KEY,
    data TEXT
);

DO $$
DECLARE
    i INTEGER := 1;
BEGIN
    WHILE i <= 500000 LOOP
        INSERT INTO wal_test (data) VALUES ('data');
        i := i + 1;
    END LOOP;
END $$;

-- => result is one L0 from initdb and one 137M-sized ephemeral-2

DO $$
DECLARE
    i INTEGER := 1;
    random_id INTEGER;
    random_record wal_test%ROWTYPE;
    start_time TIMESTAMP := clock_timestamp();
    selects_completed INTEGER := 0;
    min_id INTEGER := 1;  -- Minimum ID value
    max_id INTEGER := 100000;  -- Maximum ID value, based on your insert range
    iters INTEGER := 100000000;  -- Number of iterations to run
BEGIN
    WHILE i <= iters LOOP
        -- Generate a random ID within the known range
        random_id := min_id + floor(random() * (max_id - min_id + 1))::int;

        -- Select the row with the generated random ID
        SELECT * INTO random_record
        FROM wal_test
        WHERE id = random_id;

        -- Increment the select counter
        selects_completed := selects_completed + 1;

        -- Check if a second has passed
        IF EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= 1 THEN
            -- Print the number of selects completed in the last second
            RAISE NOTICE 'Selects completed in last second: %', selects_completed;

            -- Reset counters for the next second
            selects_completed := 0;
            start_time := clock_timestamp();
        END IF;

        -- Increment the loop counter
        i := i + 1;
    END LOOP;
END $$;

./target/release/neon_local stop

baseline: commit 7c74112b2a6e23c07bfd9cc62c240cd6bbdd3bd9 origin/main

NOTICE:  Selects completed in last second: 1864
NOTICE:  Selects completed in last second: 1850
NOTICE:  Selects completed in last second: 1851
NOTICE:  Selects completed in last second: 1918
NOTICE:  Selects completed in last second: 1911
NOTICE:  Selects completed in last second: 1879
NOTICE:  Selects completed in last second: 1858
NOTICE:  Selects completed in last second: 1827
NOTICE:  Selects completed in last second: 1933

ours

NOTICE:  Selects completed in last second: 1915
NOTICE:  Selects completed in last second: 1928
NOTICE:  Selects completed in last second: 1913
NOTICE:  Selects completed in last second: 1932
NOTICE:  Selects completed in last second: 1846
NOTICE:  Selects completed in last second: 1955
NOTICE:  Selects completed in last second: 1991
NOTICE:  Selects completed in last second: 1973

NB: the ephemeral file sizes differ by ca 1MiB, ours being 1MiB smaller.

Rollout

This PR changes the code in-place and is not gated by a feature flag.

github-actions · 2024-07-29T11:36:48Z

3807 tests run: 3700 passed, 1 failed, 106 skipped (full report)

Failures on Postgres 14

test_physical_replication: release-arm64

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_physical_replication[release-pg14]"

Flaky tests (10)

Postgres 16

test_change_pageserver: debug-x86-64
test_lfc_resize: debug-x86-64
test_location_conf_churn[2]: release-x86-64

Postgres 15

test_upgrade_generationless_local_file_paths: release-arm64
test_replica_start_scan_clog_crashed_xids: release-arm64
test_pull_timeline_partial_segment_integrity: release-arm64

Postgres 14

test_location_conf_churn[1]: release-arm64
test_location_conf_churn[2]: release-arm64
test_location_conf_churn[3]: release-arm64
test_sql_regress[4]: release-arm64

Code coverage* (full report)

functions: 32.6% (7401 of 22730 functions)
lines: 50.8% (60037 of 118295 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
b184f77 at 2024-08-28T18:40:06.504Z :recycle:}

…ces (#8717) The `tokio_epoll_uring::Slice` / `tokio_uring::Slice` type is weird. The new `FullSlice` newtype is better. See the doc comment for details. The naming is not ideal, but we'll clean that up in a future refactoring where we move the `FullSlice` into `tokio_epoll_uring`. Then, we'll do the following: * tokio_epoll_uring::Slice is removed * `FullSlice` becomes `tokio_epoll_uring::IoBufView` * new type `tokio_epoll_uring::IoBufMutView` for the current `tokio_epoll_uring::Slice<IoBufMut>` Context ------- I did this work in preparation for #8537. There, I'm changing the type that the `inmemory_layer.rs` passes to `DeltaLayerWriter::put_value_bytes` and thus it seemed like a good opportunity to make this cleanup first.

… remnants

cargo bench --bench bench_ingest -- 'ingest 128MB/100b seq, no delta' baseline: commit d9a57ae (origin/main) ingest-small-values/ingest 128MB/100b seq, no delta time: [527.44 ms 543.79 ms 562.97 ms] thrpt: [227.36 MiB/s 235.39 MiB/s 242.68 MiB/s] HEAD~1 ingest-small-values/ingest 128MB/100b seq, no delta time: [491.37 ms 494.69 ms 498.30 ms] thrpt: [256.87 MiB/s 258.75 MiB/s 260.49 MiB/s]

on my M2 MacBook Pro, running a Linux VM ./target/release/neon_local stop rm -rf .neon ./target/release/neon_local init ./target/release/neon_local start ./target/release/neon_local tenant create --set-default ./target/release/neon_local endpoint create foo ./target/release/neon_local endpoint start foo psql 'postgresql://[email protected]:55432/postgres' psql (13.16 (Debian 13.16-0+deb11u1), server 15.7) CREATE TABLE wal_test ( id SERIAL PRIMARY KEY, data TEXT ); DO $$ DECLARE i INTEGER := 1; BEGIN WHILE i <= 500000 LOOP INSERT INTO wal_test (data) VALUES ('data'); i := i + 1; END LOOP; END $$; -- => result is one L0 from initdb and one 137M-sized ephemeral-2 DO $$ DECLARE i INTEGER := 1; random_id INTEGER; random_record wal_test%ROWTYPE; start_time TIMESTAMP := clock_timestamp(); selects_completed INTEGER := 0; min_id INTEGER := 1; -- Minimum ID value max_id INTEGER := 100000; -- Maximum ID value, based on your insert range iters INTEGER := 100000000; -- Number of iterations to run BEGIN WHILE i <= iters LOOP -- Generate a random ID within the known range random_id := min_id + floor(random() * (max_id - min_id + 1))::int; -- Select the row with the generated random ID SELECT * INTO random_record FROM wal_test WHERE id = random_id; -- Increment the select counter selects_completed := selects_completed + 1; -- Check if a second has passed IF EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= 1 THEN -- Print the number of selects completed in the last second RAISE NOTICE 'Selects completed in last second: %', selects_completed; -- Reset counters for the next second selects_completed := 0; start_time := clock_timestamp(); END IF; -- Increment the loop counter i := i + 1; END LOOP; END $$; ./target/release/neon_local stop baseline: commit d9a57ae origin/main, NOTICE: Selects completed in last second: 1286 NOTICE: Selects completed in last second: 1352 NOTICE: Selects completed in last second: 1365 NOTICE: Selects completed in last second: 1399 NOTICE: Selects completed in last second: 1410 NOTICE: Selects completed in last second: 1393 NOTICE: Selects completed in last second: 1316 ours NOTICE: Selects completed in last second: 1541 NOTICE: Selects completed in last second: 1536 NOTICE: Selects completed in last second: 1493 NOTICE: Selects completed in last second: 1379 NOTICE: Selects completed in last second: 1519 NOTICE: Selects completed in last second: 1546 NOTICE: Selects completed in last second: 1489 NOTICE: Selects completed in last second: 1578 NOTICE: Selects completed in last second: 1508

pageserver/src/tenant/storage_layer/inmemory_layer.rs

problame · 2024-08-15T16:45:28Z

@jcsp @VladLazar I think this is ready for a first review.

Wanted to get @jcsp 's eyes on it because of sharded ingest.

And @VladLazar 's eyes because of vectored get / direct IO.

@yliang412, you might also want to take another look at get_values_reconstruct_data, it evolved a bit since you last checked. But, no need to do a full review.

VladLazar

I focused on the write path and the new implementation of get_values_reconstruct_data. I need another pass for the rest.

pageserver/src/tenant/ephemeral_file.rs

pageserver/src/tenant/storage_layer/inmemory_layer.rs

… (comment)

…_ok for load_to_vec

problame · 2024-08-22T17:52:11Z

Reran benchmarks, still a slight improvement (see "edits" on PR description for prior benchmark results)

VladLazar

Went through all of it again minus vectored dio. Looks good; nice job!

pageserver/src/tenant/storage_layer/inmemory_layer.rs

pageserver/src/tenant/ephemeral_file.rs

pageserver/src/tenant/storage_layer/inmemory_layer.rs

pageserver/src/tenant/ephemeral_file.rs

pageserver/src/tenant/timeline.rs

test_runner/regress/test_pageserver_layer_rolling.py

…eralFile segment

pageserver/src/tenant/storage_layer/inmemory_layer.rs

jcsp

ok in principal, couple of suggestions on naming, and would like to confirm whether the TODO about strictly enforcing checkpoint/layer sizes is a risk to deploying this

…ter top-down reading flow

…omment)

problame changed the title ~~PAUSE: this is a dead-end at this time~~ WIP: bypass PageCache for InMemoryLayer::get_values_reconstruct_data Jul 29, 2024

problame mentioned this pull request Jul 29, 2024

bypass PageCache for InMemoryLayer::get_values_reconstruct_data #8183

Closed

problame force-pushed the problame/inmemory-layer-offset-u32 branch from 2b98201 to e0b40fa Compare August 13, 2024 17:55

problame changed the base branch from main to problame/refactor-write-path-take-slice August 13, 2024 20:25

problame force-pushed the problame/inmemory-layer-offset-u32 branch 2 times, most recently from 59a0df0 to e408cba Compare August 14, 2024 14:37

problame mentioned this pull request Aug 14, 2024

refactor(write path): newtype to enforce use of fully initialized slices #8717

Merged

Base automatically changed from problame/refactor-write-path-take-slice to main August 14, 2024 19:57

problame force-pushed the problame/inmemory-layer-offset-u32 branch from e013f76 to b580a44 Compare August 14, 2024 19:58

problame added 2 commits August 14, 2024 19:58

WIP

332ca2b

implement coalescing of multiple reads onto same page

37bfa04

problame force-pushed the problame/inmemory-layer-offset-u32 branch 2 times, most recently from 6b27371 to c7e2846 Compare August 14, 2024 20:07

problame added 2 commits August 15, 2024 11:28

don't think in pages, but DIO chunks; remove read_page & page_caching…

6f65b4d

… remnants

merging of adjacent chunk reads, up to max batch size

fb78185

problame force-pushed the problame/inmemory-layer-offset-u32 branch from c7e2846 to fb78185 Compare August 15, 2024 11:31

problame added 2 commits August 15, 2024 11:46

problame changed the title ~~WIP: bypass PageCache for InMemoryLayer::get_values_reconstruct_data~~ bypass PageCache for InMemoryLayer + avoid Value::deser on L0 flush Aug 15, 2024

problame requested review from jcsp and VladLazar August 15, 2024 16:39

problame marked this pull request as ready for review August 15, 2024 16:39

problame requested a review from a team as a code owner August 15, 2024 16:39

problame commented Aug 15, 2024

View reviewed changes

pageserver/src/tenant/storage_layer/inmemory_layer.rs Outdated Show resolved Hide resolved

VladLazar reviewed Aug 16, 2024

View reviewed changes

problame added 2 commits August 19, 2024 15:22

https://github.com/neondatabase/neon/pull/8537#discussion_r1719830114

10c7419

https://github.com/neondatabase/neon/pull/8537#discussion_r1719877858

0d22d4d

problame added 8 commits August 22, 2024 15:00

https://github.com/neondatabase/neon/pull/8537#discussion_r1726853227

78f81cf

https://github.com/neondatabase/neon/pull/8537#discussion_r1726850864

09e67ec

add more EphemeralFile tests, covering read_at_to_end behavior; #8537…

c445da4

… (comment)

remove seal; #8537 (comment)

edb0ebc

improve File trait naming & docs

2d06683

more renaming to read_exact_at_eof_ok & reuse File::read_exact_at_eof…

df4571f

…_ok for load_to_vec

doc fix

3283785

rename MergedInterest to PhysicalInterest

ef1c55c

problame requested a review from VladLazar August 22, 2024 16:53

VladLazar approved these changes Aug 23, 2024

View reviewed changes

problame added 4 commits August 26, 2024 09:05

https://github.com/neondatabase/neon/pull/8537#discussion_r1728692145

175a430

adapt SerializedBatch buffer size; we no longer store length in Ephem…

af0ab1d

…eralFile segment

https://github.com/neondatabase/neon/pull/8537#discussion_r1728609900

d6827cc

better name for validation + check on startup; #8537 (comment)

e8ecff6