Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bypass PageCache for InMemoryLayer + avoid Value::deser on L0 flush #8537

Merged
merged 52 commits into from
Aug 28, 2024

Conversation

problame
Copy link
Contributor

@problame problame commented Jul 29, 2024

Part of Epic: Bypass PageCache for user data blocks.

Problem

InMemoryLayer still uses the PageCache for all data stored in the VirtualFile that underlies the EphemeralFile.

Background

Before this PR, EphemeralFile is a fancy and (code-bloated) buffered writer around a VirtualFile that supports blob_io.

The InMemoryLayerInner::index stores offsets into the EphemeralFile.
At those offset, we find a varint length followed by the serialized Value.

Vectored reads (get_values_reconstruct_data) are not in fact vectored - each Value that needs to be read is read sequentially.

The will_init bit of information which we use to early-exit the get_values_reconstruct_data for a given key is stored in the serialized Value, meaning we have to read & deserialize the Value from the EphemeralFile.

The L0 flushing also needs to re-determine the will_init bit of information, by deserializing each value during L0 flush.

Changes

  1. Store the value length and will_init information in the InMemoryLayer::index. The EphemeralFile thus only needs to store the values.
  2. For get_values_reconstruct_data:
  • Use the in-memory index figures out which values need to be read. Having the will_init stored in the index enables us to do that.
  • View the EphemeralFile as a byte array of "DIO chunks", each 512 bytes in size (adjustable constant). A "DIO chunk" is the minimal unit that we can read under direct IO.
  • Figure out which chunks need to be read to retrieve the serialized bytes for thes values we need to read.
  • Coalesce chunk reads such that each DIO chunk is only read once to serve all value reads that need data from that chunk.
  • Merge adjacent chunk reads into larger EphemeralFile::read_exact_at_eof_ok of up to 128k (adjustable constant).
  1. The new EphemeralFile::read_exact_at_eof_ok fills the IO buffer from the underlying VirtualFile and/or its in-memory buffer.
  2. The L0 flush code is changed to use the index directly, blob_io
  3. We can remove the ephemeral_file::page_caching construct now.

The get_values_reconstruct_data changes seem like a bit overkill but they are necessary so we issue the equivalent amount of read system calls compared to before this PR where it was highly likely that even if the first PageCache access was a miss, remaining reads within the same get_values_reconstruct_data call from the same EphemeralFile page were a hit.

The "DIO chunk" stuff is truly unnecessary for page cache bypass, but, since we're working on direct IO and #8719 specifically, we need to do something like this anyways in the near future.

Alternative Design

The original plan was to use the vectored_blob_io code it relies on the invariant of Delta&Image layers that index order == values order.

Further, vectored_blob_io code's strategy for merging IOs is limited to adjacent reads. However, with direct IO, there is another level of merging that should be done, specifically, if multiple reads map to the same "DIO chunk" (=alignment-requirement-sized and -aligned region of the file), then it's "free" to read the chunk into an IO buffer and serve the two reads from that buffer.
=> #8719

Testing / Performance

Correctness of the IO merging code is ensured by unit tests.

Additionally, minimal tests are added for the EphemeralFile implementation and the bit-packed InMemoryLayerIndexValue.

Performance testing results are presented below.
All pref testing done on my M2 MacBook Pro, running a Linux VM.
It's a release build without --features testing.

We see definitive improvement in ingest performance microbenchmark and an ad-hoc microbenchmark for getpage against InMemoryLayer.

baseline: commit 7c74112b2a6e23c07bfd9cc62c240cd6bbdd3bd9 origin/main
HEAD: ef1c55c52e0c313be4d302794d29534591f9cdc5
cargo bench --bench bench_ingest -- 'ingest 128MB/100b seq, no delta'

baseline

ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [483.50 ms 498.73 ms 522.53 ms]
                        thrpt:  [244.96 MiB/s 256.65 MiB/s 264.73 MiB/s]

HEAD

ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [479.22 ms 482.92 ms 487.35 ms]
                        thrpt:  [262.64 MiB/s 265.06 MiB/s 267.10 MiB/s]

We don't have a micro-benchmark for InMemoryLayer and it's quite cumbersome to add one. So, I did manual testing in neon_local.


  ./target/release/neon_local stop
  rm -rf .neon
  ./target/release/neon_local init
  ./target/release/neon_local start
  ./target/release/neon_local tenant create --set-default
  ./target/release/neon_local endpoint create foo
  ./target/release/neon_local endpoint start foo
  psql 'postgresql://[email protected]:55432/postgres'
psql (13.16 (Debian 13.16-0+deb11u1), server 15.7)

CREATE TABLE wal_test (
    id SERIAL PRIMARY KEY,
    data TEXT
);

DO $$
DECLARE
    i INTEGER := 1;
BEGIN
    WHILE i <= 500000 LOOP
        INSERT INTO wal_test (data) VALUES ('data');
        i := i + 1;
    END LOOP;
END $$;

-- => result is one L0 from initdb and one 137M-sized ephemeral-2

DO $$
DECLARE
    i INTEGER := 1;
    random_id INTEGER;
    random_record wal_test%ROWTYPE;
    start_time TIMESTAMP := clock_timestamp();
    selects_completed INTEGER := 0;
    min_id INTEGER := 1;  -- Minimum ID value
    max_id INTEGER := 100000;  -- Maximum ID value, based on your insert range
    iters INTEGER := 100000000;  -- Number of iterations to run
BEGIN
    WHILE i <= iters LOOP
        -- Generate a random ID within the known range
        random_id := min_id + floor(random() * (max_id - min_id + 1))::int;

        -- Select the row with the generated random ID
        SELECT * INTO random_record
        FROM wal_test
        WHERE id = random_id;

        -- Increment the select counter
        selects_completed := selects_completed + 1;

        -- Check if a second has passed
        IF EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= 1 THEN
            -- Print the number of selects completed in the last second
            RAISE NOTICE 'Selects completed in last second: %', selects_completed;

            -- Reset counters for the next second
            selects_completed := 0;
            start_time := clock_timestamp();
        END IF;

        -- Increment the loop counter
        i := i + 1;
    END LOOP;
END $$;

./target/release/neon_local stop

baseline: commit 7c74112b2a6e23c07bfd9cc62c240cd6bbdd3bd9 origin/main

NOTICE:  Selects completed in last second: 1864
NOTICE:  Selects completed in last second: 1850
NOTICE:  Selects completed in last second: 1851
NOTICE:  Selects completed in last second: 1918
NOTICE:  Selects completed in last second: 1911
NOTICE:  Selects completed in last second: 1879
NOTICE:  Selects completed in last second: 1858
NOTICE:  Selects completed in last second: 1827
NOTICE:  Selects completed in last second: 1933

ours

NOTICE:  Selects completed in last second: 1915
NOTICE:  Selects completed in last second: 1928
NOTICE:  Selects completed in last second: 1913
NOTICE:  Selects completed in last second: 1932
NOTICE:  Selects completed in last second: 1846
NOTICE:  Selects completed in last second: 1955
NOTICE:  Selects completed in last second: 1991
NOTICE:  Selects completed in last second: 1973

NB: the ephemeral file sizes differ by ca 1MiB, ours being 1MiB smaller.

Rollout

This PR changes the code in-place and is not gated by a feature flag.

@problame problame changed the title PAUSE: this is a dead-end at this time WIP: bypass PageCache for InMemoryLayer::get_values_reconstruct_data Jul 29, 2024
Copy link

github-actions bot commented Jul 29, 2024

3807 tests run: 3700 passed, 1 failed, 106 skipped (full report)


Failures on Postgres 14

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_physical_replication[release-pg14]"
Flaky tests (10)

Postgres 16

Postgres 15

Postgres 14

Code coverage* (full report)

  • functions: 32.6% (7401 of 22730 functions)
  • lines: 50.8% (60037 of 118295 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
b184f77 at 2024-08-28T18:40:06.504Z :recycle:

@problame problame force-pushed the problame/inmemory-layer-offset-u32 branch from 2b98201 to e0b40fa Compare August 13, 2024 17:55
@problame problame changed the base branch from main to problame/refactor-write-path-take-slice August 13, 2024 20:25
@problame problame force-pushed the problame/inmemory-layer-offset-u32 branch 2 times, most recently from 59a0df0 to e408cba Compare August 14, 2024 14:37
problame added a commit that referenced this pull request Aug 14, 2024
…ces (#8717)

The `tokio_epoll_uring::Slice` / `tokio_uring::Slice` type is weird.
The new `FullSlice` newtype is better. See the doc comment for details.

The naming is not ideal, but we'll clean that up in a future refactoring
where we move the `FullSlice` into `tokio_epoll_uring`. Then, we'll do
the following:
* tokio_epoll_uring::Slice is removed
* `FullSlice` becomes `tokio_epoll_uring::IoBufView`
* new type `tokio_epoll_uring::IoBufMutView` for the current
`tokio_epoll_uring::Slice<IoBufMut>`

Context
-------

I did this work in preparation for
#8537.
There, I'm changing the type that the `inmemory_layer.rs` passes to
`DeltaLayerWriter::put_value_bytes` and thus it seemed like a good
opportunity to make this cleanup first.
Base automatically changed from problame/refactor-write-path-take-slice to main August 14, 2024 19:57
@problame problame force-pushed the problame/inmemory-layer-offset-u32 branch from e013f76 to b580a44 Compare August 14, 2024 19:58
@problame problame force-pushed the problame/inmemory-layer-offset-u32 branch 2 times, most recently from 6b27371 to c7e2846 Compare August 14, 2024 20:07
@problame problame force-pushed the problame/inmemory-layer-offset-u32 branch from c7e2846 to fb78185 Compare August 15, 2024 11:31
cargo bench --bench bench_ingest -- 'ingest 128MB/100b seq, no delta'

baseline: commit d9a57ae (origin/main)

ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [527.44 ms 543.79 ms 562.97 ms]
                        thrpt:  [227.36 MiB/s 235.39 MiB/s 242.68 MiB/s]

HEAD~1

ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [491.37 ms 494.69 ms 498.30 ms]
                        thrpt:  [256.87 MiB/s 258.75 MiB/s 260.49 MiB/s]
on my M2 MacBook Pro, running a Linux VM

  ./target/release/neon_local stop
  rm -rf .neon
  ./target/release/neon_local init
  ./target/release/neon_local start
  ./target/release/neon_local tenant create --set-default
  ./target/release/neon_local endpoint create foo
  ./target/release/neon_local endpoint start foo
  psql 'postgresql://[email protected]:55432/postgres'
psql (13.16 (Debian 13.16-0+deb11u1), server 15.7)

CREATE TABLE wal_test (
    id SERIAL PRIMARY KEY,
    data TEXT
);

DO $$
DECLARE
    i INTEGER := 1;
BEGIN
    WHILE i <= 500000 LOOP
        INSERT INTO wal_test (data) VALUES ('data');
        i := i + 1;
    END LOOP;
END $$;

-- => result is one L0 from initdb and one 137M-sized ephemeral-2

DO $$
DECLARE
    i INTEGER := 1;
    random_id INTEGER;
    random_record wal_test%ROWTYPE;
    start_time TIMESTAMP := clock_timestamp();
    selects_completed INTEGER := 0;
    min_id INTEGER := 1;  -- Minimum ID value
    max_id INTEGER := 100000;  -- Maximum ID value, based on your insert range
    iters INTEGER := 100000000;  -- Number of iterations to run
BEGIN
    WHILE i <= iters LOOP
        -- Generate a random ID within the known range
        random_id := min_id + floor(random() * (max_id - min_id + 1))::int;

        -- Select the row with the generated random ID
        SELECT * INTO random_record
        FROM wal_test
        WHERE id = random_id;

        -- Increment the select counter
        selects_completed := selects_completed + 1;

        -- Check if a second has passed
        IF EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= 1 THEN
            -- Print the number of selects completed in the last second
            RAISE NOTICE 'Selects completed in last second: %', selects_completed;

            -- Reset counters for the next second
            selects_completed := 0;
            start_time := clock_timestamp();
        END IF;

        -- Increment the loop counter
        i := i + 1;
    END LOOP;
END $$;

./target/release/neon_local stop

baseline: commit d9a57ae origin/main,

NOTICE:  Selects completed in last second: 1286
NOTICE:  Selects completed in last second: 1352
NOTICE:  Selects completed in last second: 1365
NOTICE:  Selects completed in last second: 1399
NOTICE:  Selects completed in last second: 1410
NOTICE:  Selects completed in last second: 1393
NOTICE:  Selects completed in last second: 1316

ours

NOTICE:  Selects completed in last second: 1541
NOTICE:  Selects completed in last second: 1536
NOTICE:  Selects completed in last second: 1493
NOTICE:  Selects completed in last second: 1379
NOTICE:  Selects completed in last second: 1519
NOTICE:  Selects completed in last second: 1546
NOTICE:  Selects completed in last second: 1489
NOTICE:  Selects completed in last second: 1578
NOTICE:  Selects completed in last second: 1508
@problame problame changed the title WIP: bypass PageCache for InMemoryLayer::get_values_reconstruct_data bypass PageCache for InMemoryLayer + avoid Value::deser on L0 flush Aug 15, 2024
@problame problame marked this pull request as ready for review August 15, 2024 16:39
@problame problame requested a review from a team as a code owner August 15, 2024 16:39
@problame
Copy link
Contributor Author

@jcsp @VladLazar I think this is ready for a first review.

Wanted to get @jcsp 's eyes on it because of sharded ingest.

And @VladLazar 's eyes because of vectored get / direct IO.

@yliang412, you might also want to take another look at get_values_reconstruct_data, it evolved a bit since you last checked. But, no need to do a full review.

Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I focused on the write path and the new implementation of get_values_reconstruct_data. I need another pass for the rest.

pageserver/src/tenant/ephemeral_file.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/storage_layer/inmemory_layer.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/storage_layer/inmemory_layer.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/storage_layer/inmemory_layer.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/storage_layer/inmemory_layer.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/storage_layer/inmemory_layer.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/storage_layer/inmemory_layer.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/storage_layer/inmemory_layer.rs Outdated Show resolved Hide resolved
@problame
Copy link
Contributor Author

Reran benchmarks, still a slight improvement (see "edits" on PR description for prior benchmark results)

Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through all of it again minus vectored dio. Looks good; nice job!

pageserver/src/tenant/storage_layer/inmemory_layer.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/ephemeral_file.rs Show resolved Hide resolved
pageserver/src/tenant/storage_layer/inmemory_layer.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/ephemeral_file.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@jcsp jcsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok in principal, couple of suggestions on naming, and would like to confirm whether the TODO about strictly enforcing checkpoint/layer sizes is a risk to deploying this

@problame problame enabled auto-merge (squash) August 28, 2024 18:10
@problame problame merged commit 9627747 into main Aug 28, 2024
67 checks passed
@problame problame deleted the problame/inmemory-layer-offset-u32 branch August 28, 2024 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants