WIP: page_service: higher-resolution timer for batching #9822

problame · 2024-11-20T13:37:18Z

Problem

The page_service server-side batching does not support short batching timeouts (e.g. 10us).

The reason is that we use tokio::time::sleep , which doesn't have the required resolution.
(Tokio docs state millisecond-resolution).

Solution

Use the async-timer crate for high-resolution timers.
Specifically, we use the async-timer 1.0beta15 with features=["tokio1"].

On Linux, the timer is backed by a dedicated timerfd.
It is registered with tokio through the usual AsyncFd machinery.
This choice means each page_service connection consumes an additional (timer)filedescriptor, which is sub-optimal but tolerable.

Performance Testing

I used the benchmark to determine whether this change is moving things in the right direction.
I adjusted the runtime from 60 to 5 seconds for faster iteration.

For un-batchable workloads, we examine the wall clock time and CPU time spent.
The baseline is batching disabled, the configuration we measure is 10us.

For batchable workloads, we examine the wall clock time and batching factor.

Results TBD

Alternatives Explored

The Git history of this branch contains alternatives explored along with links to benchmark results

`async-timer` stable release 0.7.4

The stable release of async-timer is 0.7.4 at the time of writing.

It uses the signal-based POSIX timer APIs (timer_create, timer_settime, etc).

I don't have a lot of experience with signals but am generally quite wary about having signals at this incredibly high frequency.

Also, according to the man page, signal-based timer API comes with rlimit caveats, which would be something we have to keep in mind for the prod deployment.

Consequently, the number of timers is limited by the RLIMIT_SIGPENDING resource limit (see setrlimit(2)).

`tokio_timerfd::Delay`

On Linux, this performs identically to what is in this PR, i.e., to the async-timer 1.0 with features=["tokio1"].

However, timerfd is a Linux-only concept, so, we wouldn't be able to compile on macOS.

With this, 10us batching timeout works, but it has some other wrinkles: - it uses the signal-based timer APIs instead of going through epoll (=> timerfd) = it needs to make a syscall for each batch, which costs around 1-2us, so, probably significant CPU time wasted on this.

This reverts commit 1639b26.

batching at 10us doesn't work well enough, prob the future is ready too soon. batching factor is just 1.5 https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780b79c8dd6d007dbb120

This reverts commit 81d9970.

Resolution not high enough to do _any_ batching at 10us or 20us https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e0047800fb74bd8f4ab6cf8e2

This reverts commit 12124b2.

Yep, it's clearly the best one with best batching factor at lowest CPU usage. https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780d0a205e081458b46db

problame · 2024-11-20T15:05:59Z

Kicked off some discussion on Slack about alternatives: https://neondb.slack.com/archives/C0277TKAJCA/p1732115135666759

github-actions · 2024-11-20T16:10:19Z

5535 tests run: 5309 passed, 0 failed, 226 skipped (full report)

Flaky tests (2)

Postgres 17

test_compute_pageserver_connection_stress: debug-x86-64

Postgres 15

test_cli_start_stop: release-arm64

Code coverage* (full report)

functions: 31.4% (7954 of 25321 functions)
lines: 49.4% (63098 of 127829 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
09e7485 at 2024-11-21T18:39:30.934Z :recycle:}

Results unchanged to 0.7.4 https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780e18416cc0faf2aca65

Best batching factor so far with no worse degradation of un-batchable workloads than the other candidates. https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780c0921fe99e1da0e8c9

=> zero batching https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e00478065a9b3e51726082885

This reverts commit 721643b.

This reverts commit 68550f0.

This reverts commit c73e9e4.

This reverts commit 689788c.

Performs identically great to the async-timer::Timer features=tokio1 impl Makes sense because it's the same thing that's happening under the hood. https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780ea9decc82281f6b8d1

This reverts commit fcda7a7.

This reverts commit 7be13bc.

This reverts commit 517dda8.

This PR adds two benchmark to demonstrate the effect of server-side getpage request batching added in #9321. For the CPU usage, I found the the `prometheus` crate's built-in CPU usage accounts the seconds at integer granularity. That's not enough you reduce the target benchmark runtime for local iteration. So, add a new `libmetrics` metric and report that. The benchmarks are disabled because [on our benchmark nodes, timer resolution isn't high enough](https://neondb.slack.com/archives/C059ZC138NR/p1732264223207449). They work (no statement about quality) on my bare-metal devbox. They will be refined and enabled once we find a fix. Candidates at time of writing are: - #9822 - #9851 Refs: - Epic: #9376 - Extracted from #9792

problame · 2024-11-29T14:50:45Z

Abandoned in favor of a timeout-less pipelined approach in #9851

# Problem The timeout-based batching adds latency to unbatchable workloads. We can choose a short batching timeout (e.g. 10us) but that requires high-resolution timers, which tokio doesn't have. I thoroughly explored options to use OS timers (see [this](#9822) abandoned PR). In short, it's not an attractive option because any timer implementation adds non-trivial overheads. # Solution The insight is that, in the steady state of a batchable workload, the time we spend in `get_vectored` will be hundreds of microseconds anyway. If we prepare the next batch concurrently to `get_vectored`, we will have a sizeable batch ready once `get_vectored` of the current batch is done and do not need an explicit timeout. This can be reasonably described as **pipelining of the protocol handler**. # Implementation We model the sub-protocol handler for pagestream requests (`handle_pagrequests`) as two futures that form a pipeline: 2. Batching: read requests from the connection and fill the current batch 3. Execution: `take` the current batch, execute it using `get_vectored`, and send the response. The Reading and Batching stage are connected through a new type of channel called `spsc_fold`. See the long comment in the `handle_pagerequests_pipelined` for details. # Changes - Refactor `handle_pagerequests` - separate functions for - reading one protocol message; produces a `BatchedFeMessage` with just one page request in it - batching; tried to merge an incoming `BatchedFeMessage` into an existing `BatchedFeMessage`; returns `None` on success and returns back the incoming message in case merging isn't possible - execution of a batched message - unify the timeline handle acquisition & request span construction; it now happen in the function that reads the protocol message - Implement serial and pipelined model - serial: what we had before any of the batching changes - read one protocol message - execute protocol messages - pipelined: the design described above - optionality for execution of the pipeline: either via concurrent futures vs tokio tasks - Pageserver config - remove batching timeout field - add ability to configure pipelining mode - add ability to limit max batch size for pipelined configurations (required for the rollout, cf neondatabase/cloud#20620 ) - ability to configure execution mode - Tests - remove `batch_timeout` parametrization - rename `test_getpage_merge_smoke` to `test_throughput` - add parametrization to test different max batch sizes and execution moes - rename `test_timer_precision` to `test_latency` - rename the test case file to `test_page_service_batching.py` - better descriptions of what the tests actually do ## On the holding The `TimelineHandle` in the pending batch While batching, we hold the `TimelineHandle` in the pending batch. Therefore, the timeline will not finish shutting down while we're batching. This is not a problem in practice because the concurrently ongoing `get_vectored` call will fail quickly with an error indicating that the timeline is shutting down. This results in the Execution stage returning a `QueryError::Shutdown`, which causes the pipeline / entire page service connection to shut down. This drops all references to the `Arc<Mutex<Option<Box<BatchedFeMessage>>>>` object, thereby dropping the contained `TimelineHandle`s. - => fixes #9850 # Performance Local run of the benchmarks, results in [this empty commit](1cf5b14) in the PR branch. Key take-aways: * `concurrent-futures` and `tasks` deliver identical `batching_factor` * tail latency impact unknown, cf #9837 * `concurrent-futures` has higher throughput than `tasks` in all workloads (=lower `time` metric) * In unbatchable workloads, `concurrent-futures` has 5% higher `CPU-per-throughput` than that of `tasks`, and 15% higher than that of `serial`. * In batchable-32 workload, `concurrent-futures` has 8% lower `CPU-per-throughput` than that of `tasks` (comparison to tput of `serial` is irrelevant) * in unbatchable workloads, mean and tail latencies of `concurrent-futures` is practically identical to `serial`, whereas `tasks` adds 20-30us of overhead Overall, `concurrent-futures` seems like a slightly more attractive choice. # Rollout This change is disabled-by-default. Rollout plan: - neondatabase/cloud#20620 # Refs - epic: #9376 - this sub-task: #9377 - the abandoned attempt to improve batching timeout resolution: #9820 - closes #9850 - fixes #9835

This was referenced Nov 20, 2024

Epic: get page throughput improvements #9376

Open

pageserver: batch get page requests and serve them with one vectored get #9377

Open

problame added 6 commits November 20, 2024 14:49

Revert "async-timer based approach"

f3ed569

This reverts commit 1639b26.

tokio::time::Interval based approach

81d9970

batching at 10us doesn't work well enough, prob the future is ready too soon. batching factor is just 1.5 https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780b79c8dd6d007dbb120

Revert "tokio::time::Interval based approach"

1d85bec

This reverts commit 81d9970.

tokio_timerfd::Interval

12124b2

Resolution not high enough to do _any_ batching at 10us or 20us https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e0047800fb74bd8f4ab6cf8e2

Revert "tokio_timerfd::Interval"

f9bf038

This reverts commit 12124b2.

problame force-pushed the problame/batching-timer branch from 77521e6 to f9bf038 Compare November 20, 2024 14:26

problame added 2 commits November 20, 2024 15:36

async-timer based approach (again, with data)

689788c

Yep, it's clearly the best one with best batching factor at lowest CPU usage. https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780d0a205e081458b46db

undo local modifications to benchmark

7be13bc

problame added 14 commits November 20, 2024 18:32

try async-timer 1.0.0-beta15 (still signal-based timers)

c73e9e4

Results unchanged to 0.7.4 https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780e18416cc0faf2aca65

async-timer 1.0.0-beta15 with features=tokio1

68550f0

Best batching factor so far with no worse degradation of un-batchable workloads than the other candidates. https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780c0921fe99e1da0e8c9

try interval-based impl to cross-chec

721643b

=> zero batching https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e00478065a9b3e51726082885

Revert "try interval-based impl to cross-chec"

5f3e6f3

This reverts commit 721643b.

Revert "async-timer 1.0.0-beta15 with features=tokio1"

cbb5817

This reverts commit 68550f0.

Revert "try async-timer 1.0.0-beta15 (still signal-based timers)"

21866fa

This reverts commit c73e9e4.

Revert "async-timer based approach (again, with data)"

469ce81

This reverts commit 689788c.

Revert "tokio_timerfd::Delay based impl"

f22ad86

This reverts commit fcda7a7.

vanilla tokio based timer impl based on tokio::time::Sleep

517dda8

Revert "undo local modifications to benchmark"

c68661d

This reverts commit 7be13bc.

Revert "vanilla tokio based timer impl based on tokio::time::Sleep"

89b6cb8

This reverts commit 517dda8.

the final choice: async-timer 1.0beta15 with features=["tokio1"]

fa7ce2c

Merge branch 'problame/merge-getpage-test' into problame/batching-timer

09e7485

This was referenced Nov 21, 2024

page_service: getpage batching: refactor & minor fixes #9792

Closed

page_service: add benchmark for batching #9820

Merged

page_service: rewrite batching to work without a timeout #9851

Merged

problame closed this Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: page_service: higher-resolution timer for batching #9822

WIP: page_service: higher-resolution timer for batching #9822

problame commented Nov 20, 2024 •

edited

Loading

problame commented Nov 20, 2024

github-actions bot commented Nov 20, 2024 •

edited

Loading

Postgres 17

Postgres 15

problame commented Nov 29, 2024

WIP: page_service: higher-resolution timer for batching #9822

WIP: page_service: higher-resolution timer for batching #9822

Conversation

problame commented Nov 20, 2024 • edited Loading

Problem

Solution

Performance Testing

Alternatives Explored

async-timer stable release 0.7.4

tokio_timerfd::Delay

problame commented Nov 20, 2024

github-actions bot commented Nov 20, 2024 • edited Loading

5535 tests run: 5309 passed, 0 failed, 226 skipped (full report)

Postgres 17

Postgres 15

Code coverage* (full report)

problame commented Nov 29, 2024

problame commented Nov 20, 2024 •

edited

Loading

`async-timer` stable release 0.7.4

`tokio_timerfd::Delay`

github-actions bot commented Nov 20, 2024 •

edited

Loading