-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT, don't review yet] Rewrite the IO scheduling algorithm to ensure cross-shard fairness of tokens #2596
Conversation
@@ -160,6 +160,10 @@ public: | |||
_rovers.release(tokens); | |||
} | |||
|
|||
void refund(T tokens) noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reposting a comment by @xemul which he wrote in michoecho@b0ec97d#r150664671. (I created this PR just to anchor his comment to something. Comments attached to commits are hard to find).
We had some experience with returning tokens to bucket. This didn't work well, mixing time-based replenish with token-based replenish had a weird effect. Mind reading #1766, specifically #1766 (comment) and #1766 (comment) comments for details. If we're going to go with this fix, we need some justification of why we don't step on the same problem again
Why cross-shard fairness? Not dropping the preempted capacity on the floor sounds like "fix the shard-local preemption" to me (#2591) |
It fixes both. The main aim of this patch is to add cross-shard fairness by grabbing tokens for many requests at once. The local preemption problem is handled as a byproduct.
Preemption is handled like this: conceptually there is no dedicated "the pending request". Rather, there is a pending token reservation, and when we finally get some tokens out of it, we just dispatch them on the highest-priority request at the moment. If, due to this, we are left with a bunch of tokens we can't use immediately (e.g. because a request with 100k tokens butted in into a reservation done earlier for a 1.5M request, so we are left with 1.4M after dispatching the 100k), we "roll them over" to the next reservation, essentially by grabbing For example, if we do a pending reservation of cap=1.5M at wanthead=10M, and we call (Note that this also means that requests bigger than Note that this change means that the the worst case I/O latency effectively increases by one io-latency-goal, (because each shard can now allocate up to Also note that I didn't give any thought to the issues you mention in #1766 when I was writing this patch. I only glanced at #1766 and didn't have the time to think how this patch interacts with them yet. |
Mind the following. All shards cannot grab more than full token-bucket limit, so there's natural limit on the amount of token a shard can get. E.g. here's how token-bucket is configured for io-properties that I have:
And the requests costs can grow as large as:
so you can charge at most 9.6 128k reads for that limit for the whole node. It's not that much |
OK, let's consider some simple io-tester job:
What would the result be with this pr? |
So you reserve capacity for large request in several grabs. There's one thing that bothers me. Below is very simplified example that tries to demonstrate it There are 2 shards, 10 tokens limit and requests cost 6 tokens each. Here's how it will move:
Here, disk gets one request in the middle of the timeline and another request at the end of it. Now let's grab tokens with per-tick-threshold of 5 tokens batches
Here, disk is idling up until the end of the timeline, then gets two requests in one go. Effectively we re-distributed the load by making it smaller (down to idle) and then compensating for the idleness after the next replenish took place (2x times). It's not necessarily a problem as shards don't always line-up as in the former example. However, in the current implementation rovers serve two purposes -- account for the available and consumed tokens and for a queue of requests ( The same thing, btw, happens with the patch from #2591 :( |
@xemul Not quite. The patch anticipates this, and does something different. The important part is: we avoid "hoarding" tokens — if we successfully grabbed some tokens, but they aren't enough to fulfill the highest-priority request, we don't keep the tokens and wait until we grab the necessary remainder, but we "roll over" the tokens by releasing them back to the bucket and immediately make a bigger combined reservation. So in your example, if shard 0 calls So a shard never "hoards" allocated-but-not-dispatched tokens for more than one poll period. If it sees that it can't dispatch in this grab cycle, then it immediately hands over the tokens to the next person in the queue, and makes up for it in the next cycle. So the first actually dispatchable request will be dispatched as soon as there is enough tokens in the bucket and all shards did one poll cycle to hand over their tokens to the dispatchable request. So dispatching isn't delayed to the end of the timeline — it's delayed by at most one poll cycle from the optimal dispatch point. |
I need to check the final version of the patch, this explanation is not clear. First, please clarify what "not enough" means. Assume in my example shard-0 first tries to grab 5 tokens. That's not enough, right? But why does it grab 5 tokens if it knows that it will need 6? Or does it grab 6 from the very beginning? |
@xemul Here's an io_properties.yaml: disks:
- mountpoint: /home
read_bandwidth: 1542559872
read_iops: 218786
write_bandwidth: 1130867072
write_iops: 121499 conf.yaml: - name: tablet-streaming
data_size: 1GB
shards: all
type: seqread
shard_info:
parallelism: 50
reqsize: 128kB
shares: 200
- name: cassandra-stress
shards: all
type: randread
data_size: 1GB
shard_info:
parallelism: 100
reqsize: 1536
shares: 1000
rps: 50
pause_distribution: poisson
- name: cassandra-stress-slight-imbalance
shards: [0]
type: randread
data_size: 1GB
shard_info:
parallelism: 100
reqsize: 1536
class: cassandra-stress
rps: 5
pause_distribution: poisson Note: this describes a workload which makes 5000 small (1.5kiB) high-priority reads per second per shard, and wants to use all spare capacity for a batch workload with low shares and large (128 kiB) request sizes. The disk can take 25k small requests per second per shard, so the high-priority part demands about 20% of the bandwidth. With those shares, it should be guaranteed ~83% of the bandwidth, so 20% should be no problem. Shard 0 is given a slightly bigger high-priority load (5500 requests/s instead of 5000 requests/s) just to ensure that it's always the bottleneck shard, to make results easier to compare. It's not strictly necessary for the problem to occur. Shard 0 before this PR. (I.e. Seastar master, except with the local preemption token loss fix from #2591 (comment) applied. Without that fix, it's naturally even worse).
Shard 0 after this PR:
(All other shards have latency better than shard 0. Disk bandwidth is saturated in both cases.) |
By the way, this thing:
is caused by yet another scheduler bug. The
(Proof, with - name: filler
data_size: 1GB
shards: all
type: seqread
shard_info:
parallelism: 10
reqsize: 128kB
shares: 10
- name: bursty_lowprio
data_size: 1GB
shards: all
type: seqread
shard_info:
parallelism: 1
reqsize: 128kB
shares: 100
batch: 50
rps: 8
- name: highprio
shards: all
type: randread
data_size: 1GB
shard_info:
parallelism: 100
reqsize: 1536
shares: 1000
rps: 50
options:
pause_distribution: poisson
sleep_type: steady Result:
(Note how the total bandwidth consumption of |
How (and where) should I try it? On my local node with master:
with "my suggested preemption fixlet"
this PR
No difference at all. My local disk is naturally faster than your io-properties.yaml
but using it makes no difference either |
I also tested this PR with job from th
So I would suggest to refrain from calling this PR "cross-shard fairness", because it's not. The "cross-shards fairness" historically refers to different effect in several other prs/issues/docs, so using the same wording here creates unwanted confusion. |
Closing in favor of #2616 |
Refs #1083. This is a dirty attempt to fix the lack of cross-shard fairness.
This draft was only created as an anchor for some comments posted in a different thread. Please don't review it (at least yet).