Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT, don't review yet] Rewrite the IO scheduling algorithm to ensure cross-shard fairness of tokens #2596

Closed
wants to merge 4 commits into from

Conversation

michoecho
Copy link
Contributor

@michoecho michoecho commented Dec 23, 2024

Refs #1083. This is a dirty attempt to fix the lack of cross-shard fairness.

This draft was only created as an anchor for some comments posted in a different thread. Please don't review it (at least yet).

@@ -160,6 +160,10 @@ public:
_rovers.release(tokens);
}

void refund(T tokens) noexcept {
Copy link
Contributor Author

@michoecho michoecho Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reposting a comment by @xemul which he wrote in michoecho@b0ec97d#r150664671. (I created this PR just to anchor his comment to something. Comments attached to commits are hard to find).

We had some experience with returning tokens to bucket. This didn't work well, mixing time-based replenish with token-based replenish had a weird effect. Mind reading #1766, specifically #1766 (comment) and #1766 (comment) comments for details. If we're going to go with this fix, we need some justification of why we don't step on the same problem again

@michoecho michoecho changed the title Rewrite the IO scheduling algorithm to ensure cross-shard fairness of tokens [DRAFT, don't review yet] Rewrite the IO scheduling algorithm to ensure cross-shard fairness of tokens Dec 23, 2024
@xemul
Copy link
Contributor

xemul commented Dec 23, 2024

This is a dirty attempt to fix the lack of cross-shard fairness.

Why cross-shard fairness? Not dropping the preempted capacity on the floor sounds like "fix the shard-local preemption" to me (#2591)

@michoecho
Copy link
Contributor Author

This is a dirty attempt to fix the lack of cross-shard fairness.

Why cross-shard fairness? Not dropping the preempted capacity on the floor sounds like "fix the shard-local preemption" to me

It fixes both. The main aim of this patch is to add cross-shard fairness by grabbing tokens for many requests at once. The local preemption problem is handled as a byproduct.

_queued_cap is the sum of the capacities of all queued (i.e. not dispatched and not cancelled yet) requests. When we grab tokens, we grab min(per_tick_grab_threshold, _queued_cap) tokens. Grabbing a fixed amount of tokens at a time, rather than a fixed amount of requests (i.e. one request) at a time is the key part which gives us fairness of tokens instead of fairness of IOPS.

Preemption is handled like this: conceptually there is no dedicated "the pending request". Rather, there is a pending token reservation, and when we finally get some tokens out of it, we just dispatch them on the highest-priority request at the moment. If, due to this, we are left with a bunch of tokens we can't use immediately (e.g. because a request with 100k tokens butted in into a reservation done earlier for a 1.5M request, so we are left with 1.4M after dispatching the 100k), we "roll them over" to the next reservation, essentially by grabbing min(per_tick_grab_threshold + leftovers, _queued_cap) (and then immediately refunding the unused tokens to the group head so that they are used by others rather than wasted).

For example, if we do a pending reservation of cap=1.5M at wanthead=10M, and we call dispatch_requests after grouphead=11M, and the highest-priority request at the moment is 100k tokens, then we dispatch 100k, grab another pending allocation with cap=2.9M at wanthead=13.9M, and atomically advance the grouphead by 1.4M (so to 12.4M assuming no interference).

(Note that this also means that requests bigger than per_tick_grab_threshold can take several calls to dispatch_requests to finally accumulate enough tokens to execute.)

Note that this change means that the the worst case I/O latency effectively increases by one io-latency-goal, (because each shard can now allocate up to max_request_size + io-latency-goal / smp::count  per dispatch_requests due to the leftovers, instead of the old max_request_size), but I don't really see a way around this. Every algorithm I could think of which solves the local-preemption waste problem does that.

Also note that I didn't give any thought to the issues you mention in #1766 when I was writing this patch. I only glanced at #1766 and didn't have the time to think how this patch interacts with them yet.

@xemul
Copy link
Contributor

xemul commented Dec 23, 2024

The main aim of this patch is to add cross-shard fairness by grabbing tokens for many requests at once

Mind the following. All shards cannot grab more than full token-bucket limit, so there's natural limit on the amount of token a shard can get. E.g. here's how token-bucket is configured for io-properties that I have:

  token_bucket:
    limit: 12582912
    rate: 16777216
    threshold: 80612

And the requests costs can grow as large as:

    131072: // request size in bytes
      read: 1312849
      write: 3038283

so you can charge at most 9.6 128k reads for that limit for the whole node. It's not that much

@xemul
Copy link
Contributor

xemul commented Dec 23, 2024

It fixes both.

OK, let's consider some simple io-tester job:

- name: shard_0
  shard_info:
    parallelism: 32
    reqsize: 4kB
    shares: 100
  shards:
  - '0'
  type: randread
- name: shard_1
  shard_info:
    parallelism: 32
    reqsize: 4kB
    shares: 800
  shards:
  - '1'
  type: randread

What would the result be with this pr?

@xemul
Copy link
Contributor

xemul commented Dec 23, 2024

(Note that this also means that requests bigger than per_tick_grab_threshold can take several calls to dispatch_requests to finally accumulate enough tokens to execute.)

So you reserve capacity for large request in several grabs. There's one thing that bothers me. Below is very simplified example that tries to demonstrate it

There are 2 shards, 10 tokens limit and requests cost 6 tokens each. Here's how it will move:

shard0
shard1
       ------------------------------------------------------------------
       |         |
       tail      head

// shard0 grabs 6 tokens and dispatches

shard0 |------|
shard1
       ------------------------------------------------------------------
              |  |
              t  head

// shard1 grabs 6 tokens and gets "pending"

shard0 |------|
shard1        |------|
       ------------------------------------------------------------------
                 |   |
                 h   t

// time to replenish 4 more tokens elapses, shard1 dispatches

shard0 |------|
shard1        |------|
       ------------------------------------------------------------------
                     |
                     t,h

Here, disk gets one request in the middle of the timeline and another request at the end of it. Now let's grab tokens with per-tick-threshold of 5 tokens batches

shard0
shard1
       ------------------------------------------------------------------
       |         |
       tail      head

// shard0 grabs 5 tokens and waits

shard0 |-----|
shard1
       ------------------------------------------------------------------
             |   |
             t   head

// shard1 grabs 5 tokens and waits

shard0 |-----|
shard1       |-----|
       ------------------------------------------------------------------
                 | |
                 h t

// time to replenish 4 more tokens elapses

shard0 |-----|
shard1       |-----|
       ------------------------------------------------------------------
                   |   |
                   t   h

// both shards get 1 token each and dispatch

shard0 |-----|     |-|
shard1       |-----| |-|
       ------------------------------------------------------------------
                   |   |
                   t   h

Here, disk is idling up until the end of the timeline, then gets two requests in one go. Effectively we re-distributed the load by making it smaller (down to idle) and then compensating for the idleness after the next replenish took place (2x times).

It's not necessarily a problem as shards don't always line-up as in the former example. However, in the current implementation rovers serve two purposes -- account for the available and consumed tokens and for a queue of requests (group.grab_capacity(cap) returns the "position" in this queue). With batched grabbing and one request requiring several "grabs" breaks the queuing facility. And while it's OK to overdispatch a disk with short requests, this will cause overdispatching with long requests with is worse.

The same thing, btw, happens with the patch from #2591 :(

@michoecho
Copy link
Contributor Author

There's one thing that bothers me.
...
There are 2 shards, 10 tokens limit and requests cost 6 tokens each. Here's how it will move:

@xemul Not quite. The patch anticipates this, and does something different. The important part is: we avoid "hoarding" tokens — if we successfully grabbed some tokens, but they aren't enough to fulfill the highest-priority request, we don't keep the tokens and wait until we grab the necessary remainder, but we "roll over" the tokens by releasing them back to the bucket and immediately make a bigger combined reservation.

So in your example, if shard 0 calls dispatch_requests again after shard 1 made its reservation (even before any replenishment happens), and sees that it has successfully grabbed 5 tokens, it won't allocate the remaining 1 token right after shard 1's allocation, but it will instead immediately return its 5 tokens to the bucket and request a combined reservation of 6 tokens right after shard 1's. And then shard 1 on the next dispatch_requests call will see that it can't fulfill the request with the ongoing allocation of 5 tokens, so it will cancel the allocation by returning the 3 tokens it already has and moving the group head after the 2 tokens which are right after the head, and it will make a combined reservation of 6 tokens after shard 0's allocation. And then shard 0 on its next dispatch_requests call will see that its allocation is fulfilled, and will dispatch the request.

So a shard never "hoards" allocated-but-not-dispatched tokens for more than one poll period. If it sees that it can't dispatch in this grab cycle, then it immediately hands over the tokens to the next person in the queue, and makes up for it in the next cycle. So the first actually dispatchable request will be dispatched as soon as there is enough tokens in the bucket and all shards did one poll cycle to hand over their tokens to the dispatchable request. So dispatching isn't delayed to the end of the timeline — it's delayed by at most one poll cycle from the optimal dispatch point.

@xemul
Copy link
Contributor

xemul commented Dec 23, 2024

The important part is: we avoid "hoarding" tokens — if we successfully grabbed some tokens, but they aren't enough to fulfill the highest-priority request, we don't keep the tokens and wait until we grab the necessary remainder, but we "roll over" the tokens by releasing them back to the bucket and immediately make a bigger combined reservation.

I need to check the final version of the patch, this explanation is not clear.

First, please clarify what "not enough" means. Assume in my example shard-0 first tries to grab 5 tokens. That's not enough, right? But why does it grab 5 tokens if it knows that it will need 6? Or does it grab 6 from the very beginning?
Next, what if shard grabs 6 tokens that are needed for the request, but tail overruns the head. Is it still/also "not enough"? Or is it something else?

@michoecho
Copy link
Contributor Author

michoecho commented Dec 30, 2024

@xemul Here's an io-tester-based reproducer, which tries to match the problematic Scylla workload closely:

io_properties.yaml:

disks:
- mountpoint: /home
  read_bandwidth: 1542559872
  read_iops: 218786
  write_bandwidth: 1130867072
  write_iops: 121499

conf.yaml:

- name: tablet-streaming
  data_size: 1GB
  shards: all
  type: seqread
  shard_info:
    parallelism: 50
    reqsize: 128kB
    shares: 200

- name: cassandra-stress
  shards: all
  type: randread
  data_size: 1GB
  shard_info:
    parallelism: 100
    reqsize: 1536
    shares: 1000
    rps: 50
    pause_distribution: poisson

- name: cassandra-stress-slight-imbalance
  shards: [0]
  type: randread
  data_size: 1GB
  shard_info:
    parallelism: 100
    reqsize: 1536
    class: cassandra-stress
    rps: 5
    pause_distribution: poisson

Note: this describes a workload which makes 5000 small (1.5kiB) high-priority reads per second per shard, and wants to use all spare capacity for a batch workload with low shares and large (128 kiB) request sizes. The disk can take 25k small requests per second per shard, so the high-priority part demands about 20% of the bandwidth. With those shares, it should be guaranteed ~83% of the bandwidth, so 20% should be no problem. Shard 0 is given a slightly bigger high-priority load (5500 requests/s instead of 5000 requests/s) just to ensure that it's always the bottleneck shard, to make results easier to compare. It's not strictly necessary for the problem to occur.

Shard 0 before this PR. (I.e. Seastar master, except with the local preemption token loss fix from #2591 (comment) applied. Without that fix, it's naturally even worse).

  cassandra-stress:
    throughput: 4062.70776  # kB/s
    IOPS: 2708.47192
    latencies:  # usec
      average: 36657
      p0.5: 35452
      p0.95: 63897
      p0.99: 68938
      p0.999: 69235
      max: 69850
    stats:
      total_requests: 27140
      io_queue_total_exec_sec: 3.3438970700000334
      io_queue_total_delay_sec: 1183.7850868689975
      io_queue_total_operations: 32240
      io_queue_starvation_time_sec: 6.7070636189999808
      io_queue_consumption: 0.1794613218307495
      io_queue_adjusted_consumption: 0.00059541183710098262
      io_queue_activations: 6

Shard 0 after this PR:

  cassandra-stress:
    throughput: 7484.60596  # kB/s
    IOPS: 4989.93701
    latencies:  # usec
      average: 553
      p0.5: 488
      p0.95: 1023
      p0.99: 1328
      p0.999: 6769
      max: 24388
    stats:
      total_requests: 49992
      io_queue_total_exec_sec: 7.4905127110000276
      io_queue_total_delay_sec: 22.906126038999933
      io_queue_total_operations: 54992
      io_queue_starvation_time_sec: 3.2223555049999875
      io_queue_consumption: 0.30610846805572511
      io_queue_adjusted_consumption: 0.0061656232476234437
      io_queue_activations: 9568

(All other shards have latency better than shard 0. Disk bandwidth is saturated in both cases.)

@michoecho
Copy link
Contributor Author

michoecho commented Dec 30, 2024

By the way, this thing:

max: 24388

is caused by yet another scheduler bug. The tau thing intends to let a newly-activated class to hog the queue for a moment (without breaking fairness across classes). But it doesn't take into account that the newly-activated class can be competing with other shards. And as a result, the class can monopolize the queue for smp::count * tau, not for tau. And that's what happens here: tablet-streaming monopolizes the queue for ~25 ms when it starts, hence such a high max. (Confirmed with tracing).

streaming is only activated once is this case, but if you have a bursty low-priority workload combined with a high-priority interactive workload, this pause can happen regularly, and make the tail latency of the highprio workload very high.

(Proof, with --smp=7 and same io-priorities as before, with this PR):

- name: filler
  data_size: 1GB
  shards: all
  type: seqread
  shard_info:
    parallelism: 10
    reqsize: 128kB
    shares: 10

- name: bursty_lowprio
  data_size: 1GB
  shards: all
  type: seqread
  shard_info:
    parallelism: 1
    reqsize: 128kB
    shares: 100
    batch: 50
    rps: 8

- name: highprio
  shards: all
  type: randread
  data_size: 1GB
  shard_info:
    parallelism: 100
    reqsize: 1536
    shares: 1000
    rps: 50
  options:
    pause_distribution: poisson
    sleep_type: steady

Result:

...
  highprio:
    throughput: 7037.13281  # kB/s
    IOPS: 4691.81738
    latencies:  # usec
      average: 3583
      p0.5: 550
      p0.95: 24929
      p0.99: 30138
      p0.999: 46101
      max: 60998
    stats:
      total_requests: 47422
      io_queue_total_exec_sec: 5.4919087659999608
      io_queue_total_delay_sec: 164.45132553200241
      io_queue_total_operations: 47423
      io_queue_starvation_time_sec: 7.2850427120000179
      io_queue_consumption: 0.26397624891996385
      io_queue_adjusted_consumption: 0.08064465415477752
      io_queue_activations: 11563

(Note how the total bandwidth consumption of highprio is only 0.26 / 10 * smp::count ~= 18%, and it has >90% shares across all classes, yet bursty_lowprio is allowed to butt in regularly and raise its p95 to 24ms).

@xemul
Copy link
Contributor

xemul commented Jan 9, 2025

@xemul Here's an io-tester-based reproducer, which tries to match the problematic Scylla workload closely:

How (and where) should I try it?

On my local node with --conf test.yaml --io-properties-file io.yaml -c16 --storage /home/xfs I don't see results that match yours, even "in relation". In particular

master:

  cassandra-stress:
    throughput: 7483.85986  # kB/s
    IOPS: 4989.30615

with "my suggested preemption fixlet"

  cassandra-stress:
    throughput: 7481.69922  # kB/s
    IOPS: 4987.96582

this PR

  cassandra-stress:
    throughput: 7471.59277  # kB/s
    IOPS: 4981.09521

No difference at all.

My local disk is naturally faster than your io-properties.yaml

disks:
  - mountpoint: /home/xfs
    read_iops: 442784
    read_bandwidth: 3555216640
    write_iops: 369838
    write_bandwidth: 1489170944

but using it makes no difference either

@xemul
Copy link
Contributor

xemul commented Jan 9, 2025

I also tested this PR with job from thisat PR. The result is

- shard: 0
  shard_0:
    throughput: 584428.438  # kB/s
    IOPS: 146107.703
- shard: 1
  shard_1:
    throughput: 587684.438  # kB/s
    IOPS: 146921.703

So I would suggest to refrain from calling this PR "cross-shard fairness", because it's not. The "cross-shards fairness" historically refers to different effect in several other prs/issues/docs, so using the same wording here creates unwanted confusion.

@michoecho
Copy link
Contributor Author

Closing in favor of #2616

@michoecho michoecho closed this Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants