Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: page_service: higher-resolution timer for batching #9822

Closed
wants to merge 22 commits into from

Conversation

problame
Copy link
Contributor

@problame problame commented Nov 20, 2024

Problem

The page_service server-side batching does not support short batching timeouts (e.g. 10us).

The reason is that we use tokio::time::sleep , which doesn't have the required resolution.
(Tokio docs state millisecond-resolution).

Solution

Use the async-timer crate for high-resolution timers.
Specifically, we use the async-timer 1.0beta15 with features=["tokio1"].

On Linux, the timer is backed by a dedicated timerfd.
It is registered with tokio through the usual AsyncFd machinery.
This choice means each page_service connection consumes an additional (timer)filedescriptor, which is sub-optimal but tolerable.

Performance Testing

I used the benchmark to determine whether this change is moving things in the right direction.
I adjusted the runtime from 60 to 5 seconds for faster iteration.

For un-batchable workloads, we examine the wall clock time and CPU time spent.
The baseline is batching disabled, the configuration we measure is 10us.

For batchable workloads, we examine the wall clock time and batching factor.

Results TBD

Alternatives Explored

The Git history of this branch contains alternatives explored along with links to benchmark results

async-timer stable release 0.7.4

The stable release of async-timer is 0.7.4 at the time of writing.

It uses the signal-based POSIX timer APIs (timer_create, timer_settime, etc).

I don't have a lot of experience with signals but am generally quite wary about having signals at this incredibly high frequency.

Also, according to the man page, signal-based timer API comes with rlimit caveats, which would be something we have to keep in mind for the prod deployment.

Consequently, the number of timers is limited by the RLIMIT_SIGPENDING resource limit (see setrlimit(2)).

tokio_timerfd::Delay

On Linux, this performs identically to what is in this PR, i.e., to the async-timer 1.0 with features=["tokio1"].

However, timerfd is a Linux-only concept, so, we wouldn't be able to compile on macOS.

With this, 10us batching timeout works, but it has some other wrinkles:

- it uses the signal-based timer APIs instead of going through epoll (=> timerfd)
= it needs to make a syscall for each batch, which costs around 1-2us, so, probably significant CPU time wasted on this.
batching at 10us doesn't work well enough, prob the future is ready
too soon. batching factor is just 1.5

https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780b79c8dd6d007dbb120
@problame problame force-pushed the problame/batching-timer branch from 77521e6 to f9bf038 Compare November 20, 2024 14:26
@problame
Copy link
Contributor Author

Kicked off some discussion on Slack about alternatives: https://neondb.slack.com/archives/C0277TKAJCA/p1732115135666759

Copy link

github-actions bot commented Nov 20, 2024

5535 tests run: 5309 passed, 0 failed, 226 skipped (full report)


Flaky tests (2)

Postgres 17

Postgres 15

Code coverage* (full report)

  • functions: 31.4% (7954 of 25321 functions)
  • lines: 49.4% (63098 of 127829 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
09e7485 at 2024-11-21T18:39:30.934Z :recycle:

Performs identically great to the async-timer::Timer features=tokio1 impl
Makes sense because it's the same thing that's happening under the hood.

https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e004780ea9decc82281f6b8d1
github-merge-queue bot pushed a commit that referenced this pull request Nov 25, 2024
This PR adds two benchmark to demonstrate the effect of server-side
getpage request batching added in
#9321.

For the CPU usage, I found the the `prometheus` crate's built-in CPU
usage accounts the seconds at integer granularity. That's not enough you
reduce the target benchmark runtime for local iteration. So, add a new
`libmetrics` metric and report that.

The benchmarks are disabled because [on our benchmark nodes, timer
resolution isn't high
enough](https://neondb.slack.com/archives/C059ZC138NR/p1732264223207449).
They work (no statement about quality) on my bare-metal devbox.

They will be refined and enabled once we find a fix. Candidates at time
of writing are:
- #9822
- #9851


Refs:

- Epic: #9376
- Extracted from #9792
@problame
Copy link
Contributor Author

Abandoned in favor of a timeout-less pipelined approach in #9851

@problame problame closed this Nov 29, 2024
github-merge-queue bot pushed a commit that referenced this pull request Nov 30, 2024
# Problem

The timeout-based batching adds latency to unbatchable workloads.

We can choose a short batching timeout (e.g. 10us) but that requires
high-resolution timers, which tokio doesn't have.
I thoroughly explored options to use OS timers (see
[this](#9822) abandoned PR).
In short, it's not an attractive option because any timer implementation
adds non-trivial overheads.

# Solution

The insight is that, in the steady state of a batchable workload, the
time we spend in `get_vectored` will be hundreds of microseconds anyway.

If we prepare the next batch concurrently to `get_vectored`, we will
have a sizeable batch ready once `get_vectored` of the current batch is
done and do not need an explicit timeout.

This can be reasonably described as **pipelining of the protocol
handler**.

# Implementation

We model the sub-protocol handler for pagestream requests
(`handle_pagrequests`) as two futures that form a pipeline:

2. Batching: read requests from the connection and fill the current
batch
3. Execution: `take` the current batch, execute it using `get_vectored`,
and send the response.

The Reading and Batching stage are connected through a new type of
channel called `spsc_fold`.

See the long comment in the `handle_pagerequests_pipelined` for details.

# Changes

- Refactor `handle_pagerequests`
    - separate functions for
- reading one protocol message; produces a `BatchedFeMessage` with just
one page request in it
- batching; tried to merge an incoming `BatchedFeMessage` into an
existing `BatchedFeMessage`; returns `None` on success and returns back
the incoming message in case merging isn't possible
        - execution of a batched message
- unify the timeline handle acquisition & request span construction; it
now happen in the function that reads the protocol message
- Implement serial and pipelined model
    - serial: what we had before any of the batching changes
      - read one protocol message
      - execute protocol messages
    - pipelined: the design described above
- optionality for execution of the pipeline: either via concurrent
futures vs tokio tasks
- Pageserver config
  - remove batching timeout field
  - add ability to configure pipelining mode
- add ability to limit max batch size for pipelined configurations
(required for the rollout, cf
neondatabase/cloud#20620 )
  - ability to configure execution mode
- Tests
  - remove `batch_timeout` parametrization
  - rename `test_getpage_merge_smoke` to `test_throughput`
- add parametrization to test different max batch sizes and execution
moes
  - rename `test_timer_precision` to `test_latency`
  - rename the test case file to `test_page_service_batching.py`
  - better descriptions of what the tests actually do

## On the holding The `TimelineHandle` in the pending batch

While batching, we hold the `TimelineHandle` in the pending batch.
Therefore, the timeline will not finish shutting down while we're
batching.

This is not a problem in practice because the concurrently ongoing
`get_vectored` call will fail quickly with an error indicating that the
timeline is shutting down.
This results in the Execution stage returning a `QueryError::Shutdown`,
which causes the pipeline / entire page service connection to shut down.
This drops all references to the
`Arc<Mutex<Option<Box<BatchedFeMessage>>>>` object, thereby dropping the
contained `TimelineHandle`s.

- => fixes #9850

# Performance

Local run of the benchmarks, results in [this empty
commit](1cf5b14)
in the PR branch.

Key take-aways:
* `concurrent-futures` and `tasks` deliver identical `batching_factor`
* tail latency impact unknown, cf
#9837
* `concurrent-futures` has higher throughput than `tasks` in all
workloads (=lower `time` metric)
* In unbatchable workloads, `concurrent-futures` has 5% higher
`CPU-per-throughput` than that of `tasks`, and 15% higher than that of
`serial`.
* In batchable-32 workload, `concurrent-futures` has 8% lower
`CPU-per-throughput` than that of `tasks` (comparison to tput of
`serial` is irrelevant)
* in unbatchable workloads, mean and tail latencies of
`concurrent-futures` is practically identical to `serial`, whereas
`tasks` adds 20-30us of overhead

Overall, `concurrent-futures` seems like a slightly more attractive
choice.

# Rollout

This change is disabled-by-default.

Rollout plan:
- neondatabase/cloud#20620

# Refs

- epic: #9376
- this sub-task: #9377
- the abandoned attempt to improve batching timeout resolution:
#9820
- closes #9850
- fixes #9835
awarus pushed a commit that referenced this pull request Dec 5, 2024
# Problem

The timeout-based batching adds latency to unbatchable workloads.

We can choose a short batching timeout (e.g. 10us) but that requires
high-resolution timers, which tokio doesn't have.
I thoroughly explored options to use OS timers (see
[this](#9822) abandoned PR).
In short, it's not an attractive option because any timer implementation
adds non-trivial overheads.

# Solution

The insight is that, in the steady state of a batchable workload, the
time we spend in `get_vectored` will be hundreds of microseconds anyway.

If we prepare the next batch concurrently to `get_vectored`, we will
have a sizeable batch ready once `get_vectored` of the current batch is
done and do not need an explicit timeout.

This can be reasonably described as **pipelining of the protocol
handler**.

# Implementation

We model the sub-protocol handler for pagestream requests
(`handle_pagrequests`) as two futures that form a pipeline:

2. Batching: read requests from the connection and fill the current
batch
3. Execution: `take` the current batch, execute it using `get_vectored`,
and send the response.

The Reading and Batching stage are connected through a new type of
channel called `spsc_fold`.

See the long comment in the `handle_pagerequests_pipelined` for details.

# Changes

- Refactor `handle_pagerequests`
    - separate functions for
- reading one protocol message; produces a `BatchedFeMessage` with just
one page request in it
- batching; tried to merge an incoming `BatchedFeMessage` into an
existing `BatchedFeMessage`; returns `None` on success and returns back
the incoming message in case merging isn't possible
        - execution of a batched message
- unify the timeline handle acquisition & request span construction; it
now happen in the function that reads the protocol message
- Implement serial and pipelined model
    - serial: what we had before any of the batching changes
      - read one protocol message
      - execute protocol messages
    - pipelined: the design described above
- optionality for execution of the pipeline: either via concurrent
futures vs tokio tasks
- Pageserver config
  - remove batching timeout field
  - add ability to configure pipelining mode
- add ability to limit max batch size for pipelined configurations
(required for the rollout, cf
neondatabase/cloud#20620 )
  - ability to configure execution mode
- Tests
  - remove `batch_timeout` parametrization
  - rename `test_getpage_merge_smoke` to `test_throughput`
- add parametrization to test different max batch sizes and execution
moes
  - rename `test_timer_precision` to `test_latency`
  - rename the test case file to `test_page_service_batching.py`
  - better descriptions of what the tests actually do

## On the holding The `TimelineHandle` in the pending batch

While batching, we hold the `TimelineHandle` in the pending batch.
Therefore, the timeline will not finish shutting down while we're
batching.

This is not a problem in practice because the concurrently ongoing
`get_vectored` call will fail quickly with an error indicating that the
timeline is shutting down.
This results in the Execution stage returning a `QueryError::Shutdown`,
which causes the pipeline / entire page service connection to shut down.
This drops all references to the
`Arc<Mutex<Option<Box<BatchedFeMessage>>>>` object, thereby dropping the
contained `TimelineHandle`s.

- => fixes #9850

# Performance

Local run of the benchmarks, results in [this empty
commit](1cf5b14)
in the PR branch.

Key take-aways:
* `concurrent-futures` and `tasks` deliver identical `batching_factor`
* tail latency impact unknown, cf
#9837
* `concurrent-futures` has higher throughput than `tasks` in all
workloads (=lower `time` metric)
* In unbatchable workloads, `concurrent-futures` has 5% higher
`CPU-per-throughput` than that of `tasks`, and 15% higher than that of
`serial`.
* In batchable-32 workload, `concurrent-futures` has 8% lower
`CPU-per-throughput` than that of `tasks` (comparison to tput of
`serial` is irrelevant)
* in unbatchable workloads, mean and tail latencies of
`concurrent-futures` is practically identical to `serial`, whereas
`tasks` adds 20-30us of overhead

Overall, `concurrent-futures` seems like a slightly more attractive
choice.

# Rollout

This change is disabled-by-default.

Rollout plan:
- neondatabase/cloud#20620

# Refs

- epic: #9376
- this sub-task: #9377
- the abandoned attempt to improve batching timeout resolution:
#9820
- closes #9850
- fixes #9835
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant