-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pageserver: issue concurrent IO on the read path #9353
Conversation
7370 tests run: 6985 passed, 0 failed, 385 skipped (full report)Flaky tests (7)Postgres 17
Postgres 16
Postgres 15
Postgres 14
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
b0b9206 at 2025-01-22T14:39:32.695Z :recycle: |
Is this true? It really depends on how the IO futures are implemented, but in general, dropping a future should cancel the in-flight operation and stop polling it. Assuming they're implemented that way, it should be sufficient to ensure that the caller receives the error as soon as it happens and then drops the in-flight futures by returning the error. I don't think we need any synchronization beyond that, or am I missing something? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still reviewing, just flushing some comments for now. All nits, take them or leave them.
Previously, the read path would wait for all IO in one layer visit to complete before visiting the next layer (if a subsequent visit is required). IO within one layer visit was also sequential. With this patch we gain the ability to issue IO concurrently within one layer visit **and** to move on to the next layer without waiting for IOs from the previous visit to complete. This is a slightly cleaned up version of the work done at the Lisbon hackathon.
It's obvious the method is unused, but let's break down error handling of the read path. Before this patch set, all IO was done sequentially for a given read. If one IO failed, then the error would stop the processing of the read path. Now that we are doing IO concurrently when serving a read request it's not trivial to implement the same error handling approach. As of this commit, one IO failure does not stop any other IO requests. When awaiting for the IOs to complete, we stop waiting on the first failure, but we do not signal any other pending IOs to complete and they will just fail silently. Long term, we need a better approach for this. Two broad ideas: 1. Introduce some synchronization between pending IO tasks such that new IOs are not issued after the first failure 2. Cancel any pending IOs when the first error is discovered
Previously, each pending IO sent a stupid buffer which was just what it read from the layer file for the key. This made the awaiter code confusing because on disk images in layer files don't keep the enum wrapper, but the ones in delta layers do. This commit introduces a type to make this a bit easier and cleans up the IO awaiting code a bit. We also avoid some rather silly serialize, deserialize dance.
We now only store indices in the page cache. This commit removes any caching support from the read path.
`BlobMeta::will_init` is not actually used on these code paths, but let's be kind to future ourselves and make sure it's correct.
One can configure this via the NEON_PAGESERVER_VALUE_RECONSTRUCT_IO_CONCURRENCY env var. A config is possible as well, but it's more work and this is enough for experimentation.
73aa1c6
to
dba6968
Compare
I missed this question. All "IO futures" are collected in Let's also consider what happens when all the "IO futures" after the failed one are not complete. We bail out in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done another pass to check that there aren't any issues with ordering or races, and I can't see any -- even though we dispatch IOs concurrently, we always access the results in a predetermined order.
I think this should be good to go, once we resolve the tasks vs. futures discussion above.
Fix merge conflict with #9631.
it would cause an assertion failure because we wouldn't be consuming all IOs
christian@neon-hetzner-dev-christian:[~/src/neon-work-1]: NEON_PAGESERVER_USE_ONE_RUNTIME=current_thread DEFAULT_PG_VERSION=14 BUILD_TYPE=release poetry run pytest -k 'test_ancestor_detach_branched_from[release-pg14-False-True-after]' 2025-01-21T18:42:38.794431Z WARN initial_size_calculation{tenant_id=cb106e50ddedc20995b0b1bb065ebcd9 shard_id=0000 timeline_id=e362ff10e7c4e116baee457de5c766d9}:logical_size_calculation_task: dropping ValuesReconstructState while some IOs have not been completed num_active_ios=1 sidecar_task_id=None backtrace= 0: <pageserver::tenant::storage_layer::ValuesReconstructState as core::ops::drop::Drop>::drop at /home/christian/src/neon-work-1/pageserver/src/tenant/storage_layer.rs:553:24 1: core::ptr::drop_in_place<pageserver::tenant::storage_layer::ValuesReconstructState> at /home/christian/.rustup/toolchains/1.84.0-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:521:1 2: core::ptr::drop_in_place<pageserver::tenant::timeline::Timeline::get::{{closure}}> at /home/christian/src/neon-work-1/pageserver/src/tenant/timeline.rs:1042:5 3: core::ptr::drop_in_place<pageserver::pgdatadir_mapping::<impl pageserver::tenant::timeline::Timeline>::get_current_logical_size_non_incremental::{{closure}}> at /home/christian/src/neon-work-1/pageserver/src/pgdatadir_mapping.rs:1001:67 4: core::ptr::drop_in_place<pageserver::tenant::timeline::Timeline::calculate_logical_size::{{closure}}> at /home/christian/src/neon-work-1/pageserver/src/tenant/timeline.rs:3100:18 5: core::ptr::drop_in_place<pageserver::tenant::timeline::Timeline::logical_size_calculation_task::{{closure}}::{{closure}}::{{closure}}> at /home/christian/src/neon-work-1/pageserver/src/tenant/timeline.rs:3050:22 6: core::ptr::drop_in_place<pageserver::tenant::timeline::Timeline::logical_size_calculation_task::{{closure}}::{{closure}}> at /home/christian/src/neon-work-1/pageserver/src/tenant/timeline.rs:3060:5
…t, to make tests pass (see prev commit for stack trace) CI test failures https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9353/12892883355/index.html#suites/a1c2be32556270764423c495fad75d47/92cacda354b63fd7/
…eaning and utility is dubious with concurrent IO; #9353 (comment) The issue is that get_vectored_reconstruct_data latency means something very different now with concurrent IO than what it did before, because all the time we spend on the data blocks is no longer part of the get_vectored_reconstruct_data().await wall clock time GET_RECONSTRUCT_DATA_TIME : all the 3 dashboards that use it are in my /personal/christian folder. I guess I'm free to break them 😄 https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_getpage_get_reconstruct_data_seconds&type=code RECONSTRUCT_TIME Used in a couple of dashboards I think nobody uses - Timeline Inspector - Sharding WAL streaming - Pageserver - walredo time throaway Vlad agrees with removing them for now. Maybe in the future we'll add some back pageserver_getpage_get_reconstruct_data_seconds -> pageserver_getpage_io_plan_seconds pageserver_getpage_reconstruct_data_seconds -> pageserver_getpage_io_execute_seconds
This reverts commit a528b32.
I remain irritated by the CI logs now showing Asked devprod team on Slack: https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309 |
…e (still need to figure out how to not make compat test break) https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309
…what we have for sleep
…o completion; fix that, to make tests pass) (requires previous commit)
…awn_for_test like we do in all the other tests This is a remnant from the early times of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
# Refs - extracted from #9353 # Problem Before this PR, when task_mgr shutdown is signalled, e.g. during pageserver shutdown or Tenant shutdown, initial logical size calculation stops polling and drops the future that represents the calculation. This is against the current policy that we poll all futures to completion. This became apparent during development of concurrent IO which warns if we drop a `Timeline::get_vectored` future that still has in-flight IOs. We may revise the policy in the future, but, right now initial logical size calculation is the only part of the codebase that doesn't adhere to the policy, so let's fix it. ## Code Changes - make sensitive exclusively to `Timeline::cancel` - This should be sufficient for all cases of shutdowns; the sensitivity to task_mgr shutdown is unnecessary. - this broke the various cancel tests in `test_timeline_size.py`, e.g., `test_timeline_initial_logical_size_calculation_cancellation` - the tests would time out because the await point was not sensitive to cancellation - to fix this, refactor `pausable_failpoint` so that it accepts a cancellation token - side note: we _really_ should write our own failpoint library; maybe after we get heap-allocated RequestContext, we can plumb failpoints through there.
…rent-io Conflicts: pageserver/src/tenant/timeline.rs test_runner/fixtures/neon_fixtures.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified that the two additional rust tests only add 1 minute to build time which should be ok.
…e_at_lsn` (#10476) I noticed the opportunity to simplify here while working on #9353 . The only difference is the zero-fill behavior: if one reads past rel size, `get_rel_page_at_lsn` returns a zeroed page whereas `Timeline::get` returns an error. However, the `endblk` is at most rel size large, because `nblocks` is eq `get_rel_size`, see a few lines above this change. We're using the same LSN (`self.lsn`) for everything, so there is no chance of non-determinism. Refs: - Slack discussion debating correctness: https://neondb.slack.com/archives/C033RQ5SPDH/p1737457010607119
Refs
Co-authored-by: Vlad Lazar [email protected]
Co-authored-by: Christian Schwarz [email protected]
Problem
The read path does its IOs sequentially.
This means that if N values need to be read to reconstruct a page,
we will do N IOs and getpage latency is
O(N*IoLatency)
.Solution
With this PR we gain the ability to issue IO concurrently within one
layer visit and to move on to the next layer without waiting for IOs
from the previous visit to complete.
This is an evolved version of the work done at the Lisbon hackathon,
cf #9002.
Design
will_init
now sourced from disk btree index keysOn the algorithmic level, the only change is that the
get_values_reconstruct_data
now sources
will_init
from the disk btree index key (which is PS-page_cache'd), insteadof from the
Value
, which is only available after the IO completes.Concurrent IOs, Submission & Completion
To separate IO submission from waiting for its completion, while simultaneously
feature-gating the change, we introduce the notion of an
IoConcurrency
structthrough which IO futures are "spawned".
An IO is an opaque future, and waiting for completions is handled through
tokio::sync::oneshot
channels.The oneshot Receiver's take the place of the
img
andrecords
fieldsinside
VectoredValueReconstructState
.When we're done visiting all the layers and submitting all the IOs along the way
we concurrently
collect_pending_ios
for each value, which meansfor each value there is a future that awaits all the oneshot receivers
and then calls into walredo to reconstruct the page image.
Walredo is now invoked concurrently for each value instead of sequentially.
Walredo itself remains unchanged.
The spawned IO futures are driven to completion by a sidecar tokio task that
is separate from the task that performs all the layer visiting and spawning of IOs.
That tasks receives the IO futures via an unbounded mpsc channel and
drives them to completion inside a
FuturedUnordered
.(The behavior from before this PR is available through
IoConcurrency::Sequential
,which awaits the IO futures in place, without "spawning" or "submitting" them
anywhere.)
Alternatives Explored
A few words on the rationale behind having a sidecar task and what
alternatives were considered.
One option is to queue up all IO futures in a FuturesUnordered that is polled
the first time when we
collect_pending_ios
.Firstly, the IO futures are opaque, compiler-generated futures that need
to be polled at least once to submit their IO. "At least once" because
tokio-epoll-uring may not be able to submit the IO to the kernel on first
poll right away.
Second, there are deadlocks if we don't drive the IO futures to completion
independently of the spawning task.
The reason is that both the IO futures and the spawning task may hold some
and try to acquire more shared limited resources.
For example, both spawning task and IO future may try to acquire
Another option is to spawn a short-lived
tokio::task
for each IO future.We implemented and benchmarked it during development, but found little
throughput improvement and moderate mean & tail latency degradation.
Concerns about pressure on the tokio scheduler made us discard this variant.
The sidecar task could be obsoleted if the IOs were not arbitrary code but a well-defined struct.
However,
code unchanged, which
IoConcurrency::Sequential
mode for feature-gatingthe change.
Once the new mode sidecar task implementation is rolled out everywhere,
and
::Sequential
removed, we can think about a descriptive submission & completion interface.The problems around deadlocks pointed out earlier will need to be solved then.
For example, we could eliminate VirtualFile file descriptor cache and tokio-epoll-uring slots.
The latter has been drafted in neondatabase/tokio-epoll-uring#63.
See the lengthy doc comment on
spawn_io()
for more details.Error handling
There are two error classes during reconstruct data retrieval:
A traversal error fails the entire get_vectored request, as before this PR.
A value read error only fails that value.
In any case, we preserve the existing behavior that once
get_vectored
returns, all IOs are done. Panics and failingto poll
get_vectored
to completion will leave the IOs dangling,which is safe but shouldn't happen, and so, a rate-limited
log statement will be emitted at warning level.
There is a doc comment on
collect_pending_ios
giving more code-leveldetails and rationale.
Feature Gating
The new behavior is opt-in via pageserver config.
The
Sequential
mode is the default.The only significant change in
Sequential
mode compared to beforethis PR is the buffering of results in the
oneshot
s.Code-Level Changes
Prep work:
GateGuard
clonable.Core Feature:
will_init
inBlobMeta
and source it fromthe Delta/Image/InMemory layer index, instead of determining
will_init
after we've read the value. This avoids having to read the value to
determine whether traversal can stop.
IoConcurrency
& its sidecar task.IoConcurrency
is the clonable handle.mpsc
.IoConcurrency
from high level code to theindividual layer implementations'
get_values_reconstruct_data
.We piggy-back on the
ValuesReconstructState
for this.IoConcurrency
needsto be rooted up "high" in the call stack.
page_service
: outside of pagestream loopcreate_image_layers
: when it is calledbasebackup
(only auxfiles + replorigin + SLRU segments)IoConcurrency::sequential
Timeline::get
callcollect_keyspace
is a good exampleTimelineAdaptor
code used by the compaction simulator, unused in practiveingest_xlog_dbase_create
async {}
blockoneshot
channel instead of straightin
ValuesReconstructState
oneshot
channel is wrapped inOnDiskValueIo
/OnDiskValueIoWaiter
types that aid in expressiveness and are used to keep track of
in-flight IOs so we can print warnings if we leave them dangling.
ValuesReconstructState
to hold the receiving end of theoneshot
channel akaOnDiskValueIoWaiter
.get_vectored_impl
tocollect_pending_ios
and issue walredo concurrently, in aFuturesUnordered
.Testing / Benchmarking:
field via a) an env var and b) via NeonEnvBuilder.
This will be cleaned up later.
More benchmarking will happen post-merge in nightly benchmarks, plus in staging/pre-prod.
Some intermediate helpers for manual benchmarking have been preserved in #10466 and will be landed in later PRs.
(L0 layer stack generator!)
Drive-By:
config.compatibility_neon_binpath
is always Truthy in our CI environment=> https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309
surfaced through the added WARN logs emitted when dropping a
ValuesReconstructState
that still has inflight IOs.pageserver_getpage_get_reconstruct_data_seconds
and
pageserver_getpage_reconstruct_seconds
because with planning, value readIO, and walredo happening concurrently, one can no longer attribute latency
to any one of them; we'll revisit this when Vlad's work on tracing/sampling
through RequestContext lands.
get_cached_lsn()
.The logic around this has been dead at runtime for a long time,
ever since the removal of the materialized page cache in remove materialized page cache #8105.
Testing
Unit tests use the sidecar task by default and run both modes in CI.
Python regression tests and benchmarks also use the sidecar task by default.
We'll test more in staging and possibly preprod.
Future Work
Please refer to the parent epic for the full plan.
The next step will be to fold the plumbing of IoConcurrency
into RequestContext so that the function signatures get cleaned up.
Once
Sequential
isn't used anymore, we can take the nextbig leap which is replacing the opaque IOs with structs
that have well-defined semantics.