pageserver: issue concurrent IO on the read path #9353

VladLazar · 2024-10-10T18:37:40Z

Refs

Epic: pageserver: concurrent IO on the get page read path #9378

Co-authored-by: Vlad Lazar [email protected]
Co-authored-by: Christian Schwarz [email protected]

Problem

The read path does its IOs sequentially.
This means that if N values need to be read to reconstruct a page,
we will do N IOs and getpage latency is O(N*IoLatency).

Solution

With this PR we gain the ability to issue IO concurrently within one
layer visit and to move on to the next layer without waiting for IOs
from the previous visit to complete.

This is an evolved version of the work done at the Lisbon hackathon,
cf #9002.

Design

`will_init` now sourced from disk btree index keys

On the algorithmic level, the only change is that the get_values_reconstruct_data
now sources will_init from the disk btree index key (which is PS-page_cache'd), instead
of from the Value, which is only available after the IO completes.

Concurrent IOs, Submission & Completion

To separate IO submission from waiting for its completion, while simultaneously
feature-gating the change, we introduce the notion of an IoConcurrency struct
through which IO futures are "spawned".

An IO is an opaque future, and waiting for completions is handled through
tokio::sync::oneshot channels.
The oneshot Receiver's take the place of the img and records fields
inside VectoredValueReconstructState.

When we're done visiting all the layers and submitting all the IOs along the way
we concurrently collect_pending_ios for each value, which means
for each value there is a future that awaits all the oneshot receivers
and then calls into walredo to reconstruct the page image.
Walredo is now invoked concurrently for each value instead of sequentially.
Walredo itself remains unchanged.

The spawned IO futures are driven to completion by a sidecar tokio task that
is separate from the task that performs all the layer visiting and spawning of IOs.
That tasks receives the IO futures via an unbounded mpsc channel and
drives them to completion inside a FuturedUnordered.

(The behavior from before this PR is available through IoConcurrency::Sequential,
which awaits the IO futures in place, without "spawning" or "submitting" them
anywhere.)

Alternatives Explored

A few words on the rationale behind having a sidecar task and what
alternatives were considered.

One option is to queue up all IO futures in a FuturesUnordered that is polled
the first time when we collect_pending_ios.

Firstly, the IO futures are opaque, compiler-generated futures that need
to be polled at least once to submit their IO. "At least once" because
tokio-epoll-uring may not be able to submit the IO to the kernel on first
poll right away.

Second, there are deadlocks if we don't drive the IO futures to completion
independently of the spawning task.
The reason is that both the IO futures and the spawning task may hold some
and try to acquire more shared limited resources.
For example, both spawning task and IO future may try to acquire

a VirtualFile file descriptor cache slot async mutex (observed during impl)
a tokio-epoll-uring submission slot (observed during impl)
a PageCache slot (currently this is not the case but we may move more code into the IO futures in the future)

Another option is to spawn a short-lived tokio::task for each IO future.
We implemented and benchmarked it during development, but found little
throughput improvement and moderate mean & tail latency degradation.
Concerns about pressure on the tokio scheduler made us discard this variant.

The sidecar task could be obsoleted if the IOs were not arbitrary code but a well-defined struct.
However,

the opaque futures approach taken in this PR allows leaving the existing
code unchanged, which
allows us to implement the IoConcurrency::Sequential mode for feature-gating
the change.

Once the new mode sidecar task implementation is rolled out everywhere,
and ::Sequential removed, we can think about a descriptive submission & completion interface.
The problems around deadlocks pointed out earlier will need to be solved then.
For example, we could eliminate VirtualFile file descriptor cache and tokio-epoll-uring slots.
The latter has been drafted in neondatabase/tokio-epoll-uring#63.

See the lengthy doc comment on spawn_io() for more details.

Error handling

There are two error classes during reconstruct data retrieval:

traversal errors: index lookup, move to next layer, and the like
value read IO errors

A traversal error fails the entire get_vectored request, as before this PR.
A value read error only fails that value.

In any case, we preserve the existing behavior that once
get_vectored returns, all IOs are done. Panics and failing
to poll get_vectored to completion will leave the IOs dangling,
which is safe but shouldn't happen, and so, a rate-limited
log statement will be emitted at warning level.
There is a doc comment on collect_pending_ios giving more code-level
details and rationale.

Feature Gating

The new behavior is opt-in via pageserver config.
The Sequential mode is the default.
The only significant change in Sequential mode compared to before
this PR is the buffering of results in the oneshots.

Code-Level Changes

Prep work:

Make GateGuard clonable.

Core Feature:

Traversal code: track will_init in BlobMeta and source it from
the Delta/Image/InMemory layer index, instead of determining will_init
after we've read the value. This avoids having to read the value to
determine whether traversal can stop.
Introduce IoConcurrency & its sidecar task.
- IoConcurrency is the clonable handle.
- It connects to the sidecar task via an mpsc.
Plumb through IoConcurrency from high level code to the
individual layer implementations' get_values_reconstruct_data.
We piggy-back on the ValuesReconstructState for this.
- The sidecar task should be long-lived, so, IoConcurrency needs
  to be rooted up "high" in the call stack.
- Roots as of this PR:
  - page_service: outside of pagestream loop
  - create_image_layers: when it is called
  - basebackup(only auxfiles + replorigin + SLRU segments)
- Code with no roots that uses IoConcurrency::sequential
  - any Timeline::get call
    - collect_keyspace is a good example
    - follow-up: move IoConcurrency plumbing to RequestContext & integrate with TenantHarness #10460
  - TimelineAdaptor code used by the compaction simulator, unused in practive
  - ingest_xlog_dbase_create
Transform Delta/Image/InMemoryLayer to
- do their values IO in a distinct async {} block
- extend the residence of the Delta/Image layer until the IO is done
- buffer their results in a oneshot channel instead of straight
  in ValuesReconstructState
- the oneshot channel is wrapped in OnDiskValueIo / OnDiskValueIoWaiter
  types that aid in expressiveness and are used to keep track of
  in-flight IOs so we can print warnings if we leave them dangling.
Change ValuesReconstructState to hold the receiving end of the
oneshot channel aka OnDiskValueIoWaiter.
Change get_vectored_impl to collect_pending_ios and issue walredo concurrently, in a FuturesUnordered.

Testing / Benchmarking:

Support queue-depth in pagebench for manual benchmarkinng.
Add test suite support for setting concurrency mode ps config
field via a) an env var and b) via NeonEnvBuilder.
Hacky helper to have sidecar-based IoConcurrency in tests.
This will be cleaned up later.

More benchmarking will happen post-merge in nightly benchmarks, plus in staging/pre-prod.
Some intermediate helpers for manual benchmarking have been preserved in #10466 and will be landed in later PRs.
(L0 layer stack generator!)

Drive-By:

test suite actually didn't enable batching by default because
config.compatibility_neon_binpath is always Truthy in our CI environment
=> https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309
initial logical size calculation wasn't always polled to completion, which was
surfaced through the added WARN logs emitted when dropping a
ValuesReconstructState that still has inflight IOs.
remove the timing histograms pageserver_getpage_get_reconstruct_data_seconds
and pageserver_getpage_reconstruct_seconds because with planning, value read
IO, and walredo happening concurrently, one can no longer attribute latency
to any one of them; we'll revisit this when Vlad's work on tracing/sampling
through RequestContext lands.
remove code related to get_cached_lsn().
The logic around this has been dead at runtime for a long time,
ever since the removal of the materialized page cache in remove materialized page cache #8105.

Testing

Unit tests use the sidecar task by default and run both modes in CI.
Python regression tests and benchmarks also use the sidecar task by default.
We'll test more in staging and possibly preprod.

Future Work

Please refer to the parent epic for the full plan.

The next step will be to fold the plumbing of IoConcurrency
into RequestContext so that the function signatures get cleaned up.

Once Sequential isn't used anymore, we can take the next
big leap which is replacing the opaque IOs with structs
that have well-defined semantics.

github-actions · 2024-10-10T19:30:32Z

7370 tests run: 6985 passed, 0 failed, 385 skipped (full report)

Flaky tests (7)

Postgres 17

test_isolation[None]: release-x86-64
test_scrubber_physical_gc_ancestors[None]: debug-x86-64
test_scrubber_tenant_snapshot[4]: debug-x86-64

Postgres 16

test_metrics_normal_work: release-x86-64

Postgres 15

test_metrics_normal_work: release-x86-64, release-arm64

Postgres 14

test_metrics_normal_work: release-x86-64

Code coverage* (full report)

functions: 33.5% (8493 of 25334 functions)
lines: 49.3% (71391 of 144677 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
b0b9206 at 2025-01-22T14:39:32.695Z :recycle:}

erikgrinaker · 2024-10-15T13:31:23Z

As of this commit, one IO failure does not stop any other IO requests. When awaiting for the IOs to complete, we stop waiting on the first failure, but we do not signal any other pending IOs to complete and they will just fail silently.

Is this true? It really depends on how the IO futures are implemented, but in general, dropping a future should cancel the in-flight operation and stop polling it. Assuming they're implemented that way, it should be sufficient to ensure that the caller receives the error as soon as it happens and then drops the in-flight futures by returning the error. I don't think we need any synchronization beyond that, or am I missing something?

erikgrinaker

Still reviewing, just flushing some comments for now. All nits, take them or leave them.

pageserver/src/tenant/timeline.rs

pageserver/src/tenant/vectored_blob_io.rs

pageserver/src/tenant/storage_layer.rs

pageserver/src/tenant/storage_layer/delta_layer.rs

pageserver/src/tenant/storage_layer.rs

pageserver/src/tenant/storage_layer/delta_layer.rs

Previously, the read path would wait for all IO in one layer visit to complete before visiting the next layer (if a subsequent visit is required). IO within one layer visit was also sequential. With this patch we gain the ability to issue IO concurrently within one layer visit **and** to move on to the next layer without waiting for IOs from the previous visit to complete. This is a slightly cleaned up version of the work done at the Lisbon hackathon.

It's obvious the method is unused, but let's break down error handling of the read path. Before this patch set, all IO was done sequentially for a given read. If one IO failed, then the error would stop the processing of the read path. Now that we are doing IO concurrently when serving a read request it's not trivial to implement the same error handling approach. As of this commit, one IO failure does not stop any other IO requests. When awaiting for the IOs to complete, we stop waiting on the first failure, but we do not signal any other pending IOs to complete and they will just fail silently. Long term, we need a better approach for this. Two broad ideas: 1. Introduce some synchronization between pending IO tasks such that new IOs are not issued after the first failure 2. Cancel any pending IOs when the first error is discovered

Previously, each pending IO sent a stupid buffer which was just what it read from the layer file for the key. This made the awaiter code confusing because on disk images in layer files don't keep the enum wrapper, but the ones in delta layers do. This commit introduces a type to make this a bit easier and cleans up the IO awaiting code a bit. We also avoid some rather silly serialize, deserialize dance.

We now only store indices in the page cache. This commit removes any caching support from the read path.

`BlobMeta::will_init` is not actually used on these code paths, but let's be kind to future ourselves and make sure it's correct.

One can configure this via the NEON_PAGESERVER_VALUE_RECONSTRUCT_IO_CONCURRENCY env var. A config is possible as well, but it's more work and this is enough for experimentation.

VladLazar · 2024-11-05T10:51:41Z

As of this commit, one IO failure does not stop any other IO requests. When awaiting for the IOs to complete, we stop waiting on the first failure, but we do not signal any other pending IOs to complete and they will just fail silently.

Is this true? It really depends on how the IO futures are implemented, but in general, dropping a future should cancel the in-flight operation and stop polling it. Assuming they're implemented that way, it should be sufficient to ensure that the caller receives the error as soon as it happens and then drops the in-flight futures by returning the error. I don't think we need any synchronization beyond that, or am I missing something?

I missed this question.

All "IO futures" are collected in VectoredValueReconstructState::on_disk_values. With this PR, the read path does not wait for the outcome of one IO before issuing the next one. Hence, at the end of the layer traversal, we will have created "IO futures" for all required values. Each "IO future" is independent. If the first one fails, the others are still present.

Let's also consider what happens when all the "IO futures" after the failed one are not complete. We bail out in collect_pending_ios, but the tasks for all the incomplete IOs still run since we don't call abort on the JoinHandle or have cancellation wired in.

erikgrinaker

I've done another pass to check that there aren't any issues with ordering or races, and I can't see any -- even though we dispatch IOs concurrently, we always access the results in a predetermined order.

I think this should be good to go, once we resolve the tasks vs. futures discussion above.

Fix merge conflict with #9631.

it would cause an assertion failure because we wouldn't be consuming all IOs

pageserver/src/tenant/timeline.rs

…/neon/pull/9353/files#r1923589380

christian@neon-hetzner-dev-christian:[~/src/neon-work-1]: NEON_PAGESERVER_USE_ONE_RUNTIME=current_thread DEFAULT_PG_VERSION=14 BUILD_TYPE=release poetry run pytest -k 'test_ancestor_detach_branched_from[release-pg14-False-True-after]' 2025-01-21T18:42:38.794431Z WARN initial_size_calculation{tenant_id=cb106e50ddedc20995b0b1bb065ebcd9 shard_id=0000 timeline_id=e362ff10e7c4e116baee457de5c766d9}:logical_size_calculation_task: dropping ValuesReconstructState while some IOs have not been completed num_active_ios=1 sidecar_task_id=None backtrace= 0: <pageserver::tenant::storage_layer::ValuesReconstructState as core::ops::drop::Drop>::drop at /home/christian/src/neon-work-1/pageserver/src/tenant/storage_layer.rs:553:24 1: core::ptr::drop_in_place<pageserver::tenant::storage_layer::ValuesReconstructState> at /home/christian/.rustup/toolchains/1.84.0-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:521:1 2: core::ptr::drop_in_place<pageserver::tenant::timeline::Timeline::get::{{closure}}> at /home/christian/src/neon-work-1/pageserver/src/tenant/timeline.rs:1042:5 3: core::ptr::drop_in_place<pageserver::pgdatadir_mapping::<impl pageserver::tenant::timeline::Timeline>::get_current_logical_size_non_incremental::{{closure}}> at /home/christian/src/neon-work-1/pageserver/src/pgdatadir_mapping.rs:1001:67 4: core::ptr::drop_in_place<pageserver::tenant::timeline::Timeline::calculate_logical_size::{{closure}}> at /home/christian/src/neon-work-1/pageserver/src/tenant/timeline.rs:3100:18 5: core::ptr::drop_in_place<pageserver::tenant::timeline::Timeline::logical_size_calculation_task::{{closure}}::{{closure}}::{{closure}}> at /home/christian/src/neon-work-1/pageserver/src/tenant/timeline.rs:3050:22 6: core::ptr::drop_in_place<pageserver::tenant::timeline::Timeline::logical_size_calculation_task::{{closure}}::{{closure}}> at /home/christian/src/neon-work-1/pageserver/src/tenant/timeline.rs:3060:5

…t, to make tests pass (see prev commit for stack trace) CI test failures https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9353/12892883355/index.html#suites/a1c2be32556270764423c495fad75d47/92cacda354b63fd7/

…eaning and utility is dubious with concurrent IO; #9353 (comment) The issue is that get_vectored_reconstruct_data latency means something very different now with concurrent IO than what it did before, because all the time we spend on the data blocks is no longer part of the get_vectored_reconstruct_data().await wall clock time GET_RECONSTRUCT_DATA_TIME : all the 3 dashboards that use it are in my /personal/christian folder. I guess I'm free to break them 😄 https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_getpage_get_reconstruct_data_seconds&type=code RECONSTRUCT_TIME Used in a couple of dashboards I think nobody uses - Timeline Inspector - Sharding WAL streaming - Pageserver - walredo time throaway Vlad agrees with removing them for now. Maybe in the future we'll add some back pageserver_getpage_get_reconstruct_data_seconds -> pageserver_getpage_io_plan_seconds pageserver_getpage_reconstruct_data_seconds -> pageserver_getpage_io_execute_seconds

This reverts commit a528b32.

problame · 2025-01-21T20:16:35Z

I remain irritated by the CI logs now showing =SidecarTask.

Asked devprod team on Slack: https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309

…e (still need to figure out how to not make compat test break) https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309

…what we have for sleep

…o completion; fix that, to make tests pass) (requires previous commit)

…r more universal

…awn_for_test like we do in all the other tests This is a remnant from the early times of this PR.

VladLazar

🚢

pageserver/src/tenant/storage_layer.rs

# Refs - extracted from #9353 # Problem Before this PR, when task_mgr shutdown is signalled, e.g. during pageserver shutdown or Tenant shutdown, initial logical size calculation stops polling and drops the future that represents the calculation. This is against the current policy that we poll all futures to completion. This became apparent during development of concurrent IO which warns if we drop a `Timeline::get_vectored` future that still has in-flight IOs. We may revise the policy in the future, but, right now initial logical size calculation is the only part of the codebase that doesn't adhere to the policy, so let's fix it. ## Code Changes - make sensitive exclusively to `Timeline::cancel` - This should be sufficient for all cases of shutdowns; the sensitivity to task_mgr shutdown is unnecessary. - this broke the various cancel tests in `test_timeline_size.py`, e.g., `test_timeline_initial_logical_size_calculation_cancellation` - the tests would time out because the await point was not sensitive to cancellation - to fix this, refactor `pausable_failpoint` so that it accepts a cancellation token - side note: we _really_ should write our own failpoint library; maybe after we get heap-allocated RequestContext, we can plumb failpoints through there.

…rent-io Conflicts: pageserver/src/tenant/timeline.rs test_runner/fixtures/neon_fixtures.py

Bodobolero

I verified that the two additional rust tests only add 1 minute to build time which should be ok.

…e_at_lsn` (#10476) I noticed the opportunity to simplify here while working on #9353 . The only difference is the zero-fill behavior: if one reads past rel size, `get_rel_page_at_lsn` returns a zeroed page whereas `Timeline::get` returns an error. However, the `endblk` is at most rel size large, because `nblocks` is eq `get_rel_size`, see a few lines above this change. We're using the same LSN (`self.lsn`) for everything, so there is no chance of non-determinism. Refs: - Slack discussion debating correctness: https://neondb.slack.com/archives/C033RQ5SPDH/p1737457010607119

VladLazar marked this pull request as ready for review October 14, 2024 14:01

VladLazar requested a review from a team as a code owner October 14, 2024 14:01

VladLazar requested review from arssher and erikgrinaker and removed request for arssher October 14, 2024 14:01

erikgrinaker reviewed Oct 15, 2024

View reviewed changes

erikgrinaker reviewed Oct 16, 2024

View reviewed changes

VladLazar added 16 commits November 4, 2024 14:07

pageserver: purge read path caching support

0a840fb

We now only store indices in the page cache. This commit removes any caching support from the read path.

pageserver: use correct will_init for streaming planner

11990ef

`BlobMeta::will_init` is not actually used on these code paths, but let's be kind to future ourselves and make sure it's correct.

pagserver: further simplify read path error handling

6da0b5f

pageserver: reinstate inspect_image_layers

cc57b34

pageserver: fixup some tests using low level read apis

1cbbc35

pageserver: make io serial by default

eafae92

One can configure this via the NEON_PAGESERVER_VALUE_RECONSTRUCT_IO_CONCURRENCY env var. A config is possible as well, but it's more work and this is enough for experimentation.

chore: clippy warn

6599034

chore: fixup comment

c446e3d

review: remove matches! usage

9f95864

review: improve doc comment for planner

34a43b0

review: comment on end state of io concurrency

2508c64

review: make import scope tighter

657526a

review: do some testing of parallel IO

dba6968

VladLazar force-pushed the vlad/read-path-concurrent-io branch from 73aa1c6 to dba6968 Compare November 4, 2024 13:43

VladLazar requested a review from erikgrinaker November 4, 2024 15:58

erikgrinaker reviewed Nov 5, 2024

View reviewed changes

Merge branch 'main' into vlad/read-path-concurrent-io

c72e87f

Fix merge conflict with #9631.

problame mentioned this pull request Jan 21, 2025

WIP: concurrent IO benchmarking #10466

Draft

1 task

delta_layer: remove ignore_key_with_err optimization; #9353 (comment)

14e4fcd

it would cause an assertion failure because we wouldn't be consuming all IOs

problame reviewed Jan 21, 2025

View reviewed changes

pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved

problame added 6 commits January 21, 2025 19:33

drain spawned IOs if traversal fails; https://github.com/neondatabase…

c4e0ba3

…/neon/pull/9353/files#r1923589380

debug assertion on correct record order: #9353 (comment)

d6cdf1b

Revert "debug why CI tests don't run with sidecar-task"

925dd17

This reverts commit a528b32.

problame added 3 commits January 21, 2025 21:27

actually enable concurrent IO (and batching!) by default in test suit…

361210f

…e (still need to figure out how to not make compat test break) https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309

pausable_failpoint: add ability to provide a cancel flag, similar to …

728052b

…what we have for sleep

fixup(commit b2dbc47 initial logical size calculation wasn't polled t…

a501095

…o completion; fix that, to make tests pass) (requires previous commit)

problame mentioned this pull request Jan 21, 2025

initial logical size calculation: always poll to completion #10471

Merged

problame added 3 commits January 22, 2025 00:40

pass forward compatibility

3526d9a

the previous patch didn't cover test_version_mismatch; this one is fa…

c5af3c5

…r more universal

test_get_vectored: don't parametrize inside the test, instead, use sp…

dbb88cc

…awn_for_test like we do in all the other tests This is a remnant from the early times of this PR.

VladLazar commented Jan 22, 2025

View reviewed changes

pageserver/src/tenant/storage_layer.rs Show resolved Hide resolved

problame mentioned this pull request Jan 22, 2025

basebackup: use Timeline::get for get_rel instead of get_rel_page_at_lsn #10476

Merged

problame added 2 commits January 22, 2025 14:30

run unit tests in both modes

4298e77

Merge remote-tracking branch 'origin/main' into vlad/read-path-concur…

b0b9206

…rent-io Conflicts: pageserver/src/tenant/timeline.rs test_runner/fixtures/neon_fixtures.py

problame enabled auto-merge January 22, 2025 15:07

problame removed the request for review from a team January 22, 2025 15:13

Bodobolero approved these changes Jan 22, 2025

View reviewed changes

problame added this pull request to the merge queue Jan 22, 2025

Merged via the queue into main with commit 414ed82 Jan 22, 2025
87 checks passed

problame deleted the vlad/read-path-concurrent-io branch January 22, 2025 15:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: issue concurrent IO on the read path #9353

pageserver: issue concurrent IO on the read path #9353

VladLazar commented Oct 10, 2024 •

edited by problame

Loading

github-actions bot commented Oct 10, 2024 •

edited

Loading

Postgres 17

Postgres 16

Postgres 15

Postgres 14

erikgrinaker commented Oct 15, 2024

erikgrinaker left a comment

VladLazar commented Nov 5, 2024

erikgrinaker left a comment

problame commented Jan 21, 2025

VladLazar left a comment

Bodobolero left a comment

pageserver: issue concurrent IO on the read path #9353

pageserver: issue concurrent IO on the read path #9353

Conversation

VladLazar commented Oct 10, 2024 • edited by problame Loading

Refs

Problem

Solution

Design

will_init now sourced from disk btree index keys

Concurrent IOs, Submission & Completion

Alternatives Explored

Error handling

Feature Gating

Code-Level Changes

Testing

Future Work

github-actions bot commented Oct 10, 2024 • edited Loading

7370 tests run: 6985 passed, 0 failed, 385 skipped (full report)

Postgres 17

Postgres 16

Postgres 15

Postgres 14

Code coverage* (full report)

erikgrinaker commented Oct 15, 2024

erikgrinaker left a comment

Choose a reason for hiding this comment

VladLazar commented Nov 5, 2024

erikgrinaker left a comment

Choose a reason for hiding this comment

problame commented Jan 21, 2025

VladLazar left a comment

Choose a reason for hiding this comment

Bodobolero left a comment

Choose a reason for hiding this comment

VladLazar commented Oct 10, 2024 •

edited by problame

Loading

`will_init` now sourced from disk btree index keys

github-actions bot commented Oct 10, 2024 •

edited

Loading