Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: issue concurrent IO on the read path #9353

Merged
merged 166 commits into from
Jan 22, 2025

Conversation

VladLazar
Copy link
Contributor

@VladLazar VladLazar commented Oct 10, 2024

Refs

Co-authored-by: Vlad Lazar [email protected]
Co-authored-by: Christian Schwarz [email protected]

Problem

The read path does its IOs sequentially.
This means that if N values need to be read to reconstruct a page,
we will do N IOs and getpage latency is O(N*IoLatency).

Solution

With this PR we gain the ability to issue IO concurrently within one
layer visit and to move on to the next layer without waiting for IOs
from the previous visit to complete.

This is an evolved version of the work done at the Lisbon hackathon,
cf #9002.

Design

will_init now sourced from disk btree index keys

On the algorithmic level, the only change is that the get_values_reconstruct_data
now sources will_init from the disk btree index key (which is PS-page_cache'd), instead
of from the Value, which is only available after the IO completes.

Concurrent IOs, Submission & Completion

To separate IO submission from waiting for its completion, while simultaneously
feature-gating the change, we introduce the notion of an IoConcurrency struct
through which IO futures are "spawned".

An IO is an opaque future, and waiting for completions is handled through
tokio::sync::oneshot channels.
The oneshot Receiver's take the place of the img and records fields
inside VectoredValueReconstructState.

When we're done visiting all the layers and submitting all the IOs along the way
we concurrently collect_pending_ios for each value, which means
for each value there is a future that awaits all the oneshot receivers
and then calls into walredo to reconstruct the page image.
Walredo is now invoked concurrently for each value instead of sequentially.
Walredo itself remains unchanged.

The spawned IO futures are driven to completion by a sidecar tokio task that
is separate from the task that performs all the layer visiting and spawning of IOs.
That tasks receives the IO futures via an unbounded mpsc channel and
drives them to completion inside a FuturedUnordered.

(The behavior from before this PR is available through IoConcurrency::Sequential,
which awaits the IO futures in place, without "spawning" or "submitting" them
anywhere.)

Alternatives Explored

A few words on the rationale behind having a sidecar task and what
alternatives were considered.

One option is to queue up all IO futures in a FuturesUnordered that is polled
the first time when we collect_pending_ios.

Firstly, the IO futures are opaque, compiler-generated futures that need
to be polled at least once to submit their IO. "At least once" because
tokio-epoll-uring may not be able to submit the IO to the kernel on first
poll right away.

Second, there are deadlocks if we don't drive the IO futures to completion
independently of the spawning task.
The reason is that both the IO futures and the spawning task may hold some
and try to acquire more shared limited resources.
For example, both spawning task and IO future may try to acquire

  • a VirtualFile file descriptor cache slot async mutex (observed during impl)
  • a tokio-epoll-uring submission slot (observed during impl)
  • a PageCache slot (currently this is not the case but we may move more code into the IO futures in the future)

Another option is to spawn a short-lived tokio::task for each IO future.
We implemented and benchmarked it during development, but found little
throughput improvement and moderate mean & tail latency degradation.
Concerns about pressure on the tokio scheduler made us discard this variant.

The sidecar task could be obsoleted if the IOs were not arbitrary code but a well-defined struct.
However,

  1. the opaque futures approach taken in this PR allows leaving the existing
    code unchanged, which
  2. allows us to implement the IoConcurrency::Sequential mode for feature-gating
    the change.

Once the new mode sidecar task implementation is rolled out everywhere,
and ::Sequential removed, we can think about a descriptive submission & completion interface.
The problems around deadlocks pointed out earlier will need to be solved then.
For example, we could eliminate VirtualFile file descriptor cache and tokio-epoll-uring slots.
The latter has been drafted in neondatabase/tokio-epoll-uring#63.

See the lengthy doc comment on spawn_io() for more details.

Error handling

There are two error classes during reconstruct data retrieval:

  • traversal errors: index lookup, move to next layer, and the like
  • value read IO errors

A traversal error fails the entire get_vectored request, as before this PR.
A value read error only fails that value.

In any case, we preserve the existing behavior that once
get_vectored returns, all IOs are done. Panics and failing
to poll get_vectored to completion will leave the IOs dangling,
which is safe but shouldn't happen, and so, a rate-limited
log statement will be emitted at warning level.
There is a doc comment on collect_pending_ios giving more code-level
details and rationale.

Feature Gating

The new behavior is opt-in via pageserver config.
The Sequential mode is the default.
The only significant change in Sequential mode compared to before
this PR is the buffering of results in the oneshots.

Code-Level Changes

Prep work:

  • Make GateGuard clonable.

Core Feature:

  • Traversal code: track will_init in BlobMeta and source it from
    the Delta/Image/InMemory layer index, instead of determining will_init
    after we've read the value. This avoids having to read the value to
    determine whether traversal can stop.
  • Introduce IoConcurrency & its sidecar task.
    • IoConcurrency is the clonable handle.
    • It connects to the sidecar task via an mpsc.
  • Plumb through IoConcurrency from high level code to the
    individual layer implementations' get_values_reconstruct_data.
    We piggy-back on the ValuesReconstructState for this.
    • The sidecar task should be long-lived, so, IoConcurrency needs
      to be rooted up "high" in the call stack.
    • Roots as of this PR:
      • page_service: outside of pagestream loop
      • create_image_layers: when it is called
      • basebackup(only auxfiles + replorigin + SLRU segments)
    • Code with no roots that uses IoConcurrency::sequential
  • Transform Delta/Image/InMemoryLayer to
    • do their values IO in a distinct async {} block
    • extend the residence of the Delta/Image layer until the IO is done
    • buffer their results in a oneshot channel instead of straight
      in ValuesReconstructState
    • the oneshot channel is wrapped in OnDiskValueIo / OnDiskValueIoWaiter
      types that aid in expressiveness and are used to keep track of
      in-flight IOs so we can print warnings if we leave them dangling.
  • Change ValuesReconstructState to hold the receiving end of the
    oneshot channel aka OnDiskValueIoWaiter.
  • Change get_vectored_impl to collect_pending_ios and issue walredo concurrently, in a FuturesUnordered.

Testing / Benchmarking:

  • Support queue-depth in pagebench for manual benchmarkinng.
  • Add test suite support for setting concurrency mode ps config
    field via a) an env var and b) via NeonEnvBuilder.
  • Hacky helper to have sidecar-based IoConcurrency in tests.
    This will be cleaned up later.

More benchmarking will happen post-merge in nightly benchmarks, plus in staging/pre-prod.
Some intermediate helpers for manual benchmarking have been preserved in #10466 and will be landed in later PRs.
(L0 layer stack generator!)

Drive-By:

  • test suite actually didn't enable batching by default because
    config.compatibility_neon_binpath is always Truthy in our CI environment
    => https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309
  • initial logical size calculation wasn't always polled to completion, which was
    surfaced through the added WARN logs emitted when dropping a
    ValuesReconstructState that still has inflight IOs.
  • remove the timing histograms pageserver_getpage_get_reconstruct_data_seconds
    and pageserver_getpage_reconstruct_seconds because with planning, value read
    IO, and walredo happening concurrently, one can no longer attribute latency
    to any one of them; we'll revisit this when Vlad's work on tracing/sampling
    through RequestContext lands.
  • remove code related to get_cached_lsn().
    The logic around this has been dead at runtime for a long time,
    ever since the removal of the materialized page cache in remove materialized page cache #8105.

Testing

Unit tests use the sidecar task by default and run both modes in CI.
Python regression tests and benchmarks also use the sidecar task by default.
We'll test more in staging and possibly preprod.

Future Work

Please refer to the parent epic for the full plan.

The next step will be to fold the plumbing of IoConcurrency
into RequestContext so that the function signatures get cleaned up.

Once Sequential isn't used anymore, we can take the next
big leap which is replacing the opaque IOs with structs
that have well-defined semantics.

Copy link

github-actions bot commented Oct 10, 2024

7370 tests run: 6985 passed, 0 failed, 385 skipped (full report)


Flaky tests (7)

Postgres 17

Postgres 16

Postgres 15

Postgres 14

Code coverage* (full report)

  • functions: 33.5% (8493 of 25334 functions)
  • lines: 49.3% (71391 of 144677 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
b0b9206 at 2025-01-22T14:39:32.695Z :recycle:

@VladLazar VladLazar marked this pull request as ready for review October 14, 2024 14:01
@VladLazar VladLazar requested a review from a team as a code owner October 14, 2024 14:01
@VladLazar VladLazar requested review from arssher and erikgrinaker and removed request for arssher October 14, 2024 14:01
@erikgrinaker
Copy link
Contributor

As of this commit, one IO failure does not stop any other IO requests. When awaiting for the IOs to complete, we stop waiting on the first failure, but we do not signal any other pending IOs to complete and they will just fail silently.

Is this true? It really depends on how the IO futures are implemented, but in general, dropping a future should cancel the in-flight operation and stop polling it. Assuming they're implemented that way, it should be sufficient to ensure that the caller receives the error as soon as it happens and then drops the in-flight futures by returning the error. I don't think we need any synchronization beyond that, or am I missing something?

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing, just flushing some comments for now. All nits, take them or leave them.

Previously, the read path would wait for all IO in one layer visit to
complete before visiting the next layer (if a subsequent visit is
required). IO within one layer visit was also sequential.

With this patch we gain the ability to issue IO concurrently within one
layer visit **and** to move on to the next layer without waiting for IOs
from the previous visit to complete.

This is a slightly cleaned up version of the work done at the Lisbon
hackathon.
It's obvious the method is unused, but let's break down error handling
of the read path. Before this patch set, all IO was done sequentially
for a given read. If one IO failed, then the error would stop the
processing of the read path.

Now that we are doing IO concurrently when serving a read request
it's not trivial to implement the same error handling approach.
As of this commit, one IO failure does not stop any other IO requests.
When awaiting for the IOs to complete, we stop waiting on the first
failure, but we do not signal any other pending IOs to complete and
they will just fail silently.

Long term, we need a better approach for this. Two broad ideas:
1. Introduce some synchronization between pending IO tasks such
that new IOs are not issued after the first failure
2. Cancel any pending IOs when the first error is discovered
Previously, each pending IO sent a stupid buffer which was just what it
read from the layer file for the key. This made the awaiter code
confusing because on disk images in layer files don't keep the enum wrapper,
but the ones in delta layers do.

This commit introduces a type to make this a bit easier and cleans up
the IO awaiting code a bit. We also avoid some rather silly serialize,
deserialize dance.
We now only store indices in the page cache.
This commit removes any caching support from the read path.
`BlobMeta::will_init` is not actually used on these code paths,
but let's be kind to future ourselves and make sure it's correct.
One can configure this via the NEON_PAGESERVER_VALUE_RECONSTRUCT_IO_CONCURRENCY
env var. A config is possible as well, but it's more work and this is
enough for experimentation.
@VladLazar VladLazar force-pushed the vlad/read-path-concurrent-io branch from 73aa1c6 to dba6968 Compare November 4, 2024 13:43
@VladLazar
Copy link
Contributor Author

As of this commit, one IO failure does not stop any other IO requests. When awaiting for the IOs to complete, we stop waiting on the first failure, but we do not signal any other pending IOs to complete and they will just fail silently.

Is this true? It really depends on how the IO futures are implemented, but in general, dropping a future should cancel the in-flight operation and stop polling it. Assuming they're implemented that way, it should be sufficient to ensure that the caller receives the error as soon as it happens and then drops the in-flight futures by returning the error. I don't think we need any synchronization beyond that, or am I missing something?

I missed this question.

All "IO futures" are collected in VectoredValueReconstructState::on_disk_values. With this PR, the read path does not wait for the outcome of one IO before issuing the next one. Hence, at the end of the layer traversal, we will have created "IO futures" for all required values. Each "IO future" is independent. If the first one fails, the others are still present.

Let's also consider what happens when all the "IO futures" after the failed one are not complete. We bail out in collect_pending_ios, but the tasks for all the incomplete IOs still run since we don't call abort on the JoinHandle or have cancellation wired in.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done another pass to check that there aren't any issues with ordering or races, and I can't see any -- even though we dispatch IOs concurrently, we always access the results in a predetermined order.

I think this should be good to go, once we resolve the tasks vs. futures discussion above.

it would cause an assertion failure because we wouldn't be consuming all IOs
christian@neon-hetzner-dev-christian:[~/src/neon-work-1]: NEON_PAGESERVER_USE_ONE_RUNTIME=current_thread DEFAULT_PG_VERSION=14 BUILD_TYPE=release poetry run pytest -k 'test_ancestor_detach_branched_from[release-pg14-False-True-after]'

2025-01-21T18:42:38.794431Z  WARN initial_size_calculation{tenant_id=cb106e50ddedc20995b0b1bb065ebcd9 shard_id=0000 timeline_id=e362ff10e7c4e116baee457de5c766d9}:logical_size_calculation_task: dropping ValuesReconstructState while some IOs have not been completed num_active_ios=1 sidecar_task_id=None backtrace=   0: <pageserver::tenant::storage_layer::ValuesReconstructState as core::ops::drop::Drop>::drop
             at /home/christian/src/neon-work-1/pageserver/src/tenant/storage_layer.rs:553:24
   1: core::ptr::drop_in_place<pageserver::tenant::storage_layer::ValuesReconstructState>
             at /home/christian/.rustup/toolchains/1.84.0-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:521:1
   2: core::ptr::drop_in_place<pageserver::tenant::timeline::Timeline::get::{{closure}}>
             at /home/christian/src/neon-work-1/pageserver/src/tenant/timeline.rs:1042:5
   3: core::ptr::drop_in_place<pageserver::pgdatadir_mapping::<impl pageserver::tenant::timeline::Timeline>::get_current_logical_size_non_incremental::{{closure}}>
             at /home/christian/src/neon-work-1/pageserver/src/pgdatadir_mapping.rs:1001:67
   4: core::ptr::drop_in_place<pageserver::tenant::timeline::Timeline::calculate_logical_size::{{closure}}>
             at /home/christian/src/neon-work-1/pageserver/src/tenant/timeline.rs:3100:18
   5: core::ptr::drop_in_place<pageserver::tenant::timeline::Timeline::logical_size_calculation_task::{{closure}}::{{closure}}::{{closure}}>
             at /home/christian/src/neon-work-1/pageserver/src/tenant/timeline.rs:3050:22
   6: core::ptr::drop_in_place<pageserver::tenant::timeline::Timeline::logical_size_calculation_task::{{closure}}::{{closure}}>
             at /home/christian/src/neon-work-1/pageserver/src/tenant/timeline.rs:3060:5
…eaning and utility is dubious with concurrent IO; #9353 (comment)

The issue is that get_vectored_reconstruct_data latency means something
very different now with concurrent IO than what it did before, because
all the time we spend on the data blocks is no longer part of the
get_vectored_reconstruct_data().await wall clock time

GET_RECONSTRUCT_DATA_TIME : all the 3 dashboards that use it  are in my /personal/christian folder. I guess I'm free to break them 😄
https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_getpage_get_reconstruct_data_seconds&type=code

RECONSTRUCT_TIME
Used in a couple of dashboards I think nobody uses
- Timeline Inspector
- Sharding WAL streaming
- Pageserver
- walredo time throaway

Vlad agrees with removing them for now.
Maybe in the future we'll add some back

pageserver_getpage_get_reconstruct_data_seconds -> pageserver_getpage_io_plan_seconds
pageserver_getpage_reconstruct_data_seconds -> pageserver_getpage_io_execute_seconds
@problame
Copy link
Contributor

I remain irritated by the CI logs now showing =SidecarTask.

Asked devprod team on Slack: https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309

…awn_for_test like we do in all the other tests

This is a remnant from the early times of this PR.
Copy link
Contributor Author

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

github-merge-queue bot pushed a commit that referenced this pull request Jan 22, 2025
# Refs

- extracted from #9353

# Problem

Before this PR, when task_mgr shutdown is signalled, e.g. during
pageserver shutdown or Tenant shutdown, initial logical size calculation
stops polling and drops the future that represents the calculation.

This is against the current policy that we poll all futures to
completion.

This became apparent during development of concurrent IO which warns if
we drop a `Timeline::get_vectored` future that still has in-flight IOs.

We may revise the policy in the future, but, right now initial logical
size calculation is the only part of the codebase that doesn't adhere to
the policy, so let's fix it.

## Code Changes

- make sensitive exclusively to `Timeline::cancel`
- This should be sufficient for all cases of shutdowns; the sensitivity
to task_mgr shutdown is unnecessary.
- this broke the various cancel tests in `test_timeline_size.py`, e.g.,
`test_timeline_initial_logical_size_calculation_cancellation`
- the tests would time out because the await point was not sensitive to
cancellation
- to fix this, refactor `pausable_failpoint` so that it accepts a
cancellation token
- side note: we _really_ should write our own failpoint library; maybe
after we get heap-allocated RequestContext, we can plumb failpoints
through there.
…rent-io

Conflicts:
	pageserver/src/tenant/timeline.rs
	test_runner/fixtures/neon_fixtures.py
@problame problame enabled auto-merge January 22, 2025 15:07
@problame problame removed the request for review from a team January 22, 2025 15:13
Copy link
Contributor

@Bodobolero Bodobolero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that the two additional rust tests only add 1 minute to build time which should be ok.

@problame problame added this pull request to the merge queue Jan 22, 2025
Merged via the queue into main with commit 414ed82 Jan 22, 2025
87 checks passed
@problame problame deleted the vlad/read-path-concurrent-io branch January 22, 2025 15:43
github-merge-queue bot pushed a commit that referenced this pull request Feb 14, 2025
…e_at_lsn` (#10476)

I noticed the opportunity to simplify here while working on
#9353 .

The only difference is the zero-fill behavior: if one reads past rel
size,
`get_rel_page_at_lsn` returns a zeroed page whereas `Timeline::get`
returns an error.

However, the `endblk` is at most rel size large, because `nblocks` is eq
`get_rel_size`, see a few lines above this change.

We're using the same LSN (`self.lsn`) for everything, so there is no
chance of non-determinism.

Refs:

- Slack discussion debating correctness:
https://neondb.slack.com/archives/C033RQ5SPDH/p1737457010607119
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants