forked from scylladb/seastar
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Downstream steal and TCP buf fixes #148
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
reactor::_total_sleep, a std::chrono::duration, was left uninitialized in the constructor, potentially leading to wild values. (cherry picked from commit 1c930c4)
Conceptually at all wall-clock moments the reactor is in one of three states: awake-busy awake-idle sleeping Roughly speaking awake-busy is when the reactor has tasks or run or IO to submit or reap, or there are waiting messages for this shard on the cross-shard queues. This is the time counted as utilized in "reactor utilization". awake-idle means that the reactor is still running (or wanting to run on the CPU), but does not have any work to do _now_, rather it spends its time polling the cross-shard queues for new work, seeing if IO has arrived to reap etc. From the OS point of view it is still demanding CPU time just like the awake-busy state, but in the internal reactor accounting we know it is not making forward progress on work, rather just waiting for work to arrive. sleeping - after some time in the awake-idle state and if certain other conditions are met, the reactor may actually go to sleep. In this state is uses an OS blocking call like io_getevents with a timeout to deschedule itself from the CPU and be woken only when additional work arrives (which will be detected somehow by the OS blocking mechanism). In this state the reactor is not using additional CPU. Having a clear view of these states in the reactor metrics is useful and this change adds all three metrics with names similar to the above. (cherry picked from commit 4c310e8)
We track sleep time by taking timestamps before and after we start and stop "sleep". This accounting is key for things like steal time, which uses sleep time to subtract from real time to determine the time the reactor wanted to run. Currently we take the start timestamp far too early, and also in many cases where we do not actually sleep: we take it before we "reap events + maybe sleep" but in many cases we do not sleep, yet count it as sleep time. So we can do significant work in time which is being recorded as "sleep". On specific effect of this is that steal time can be negative, because of the way steal time is calculated: inflated sleep time leads to CPU time used (rusage) to be greater than reactor awake time (wall clock based). After this change, steal time can still be slightly negative but the effect is reduced by 1 or 2 orders of magnitude. (cherry picked from commit 0be77de)
Six years ago this function was added/emptied but it hasn't had anything added to give it a purpose in life so it seems like a good time to remove it. (cherry picked from commit 776fa65)
total_steal_time() can decrease in value from call to call, a violation of the rule that a counter metric must never decrease. This happens because steal time is "awake time (wall clock)" minus "cpu thread time (get_rusage style CPU time)". The awake time is itself composed of "time since reactor start" minus "accumulated sleep time", but it can be under-counted because sleep time is over-counted: as it's not possible to determine exactly the true sleep time, only get timestamps before and after a period you think might involve a sleep. Currently, sleep is even more significantly over-counted than the error described above as it is measured at a point which includes significant non-sleep work. The result is that when there is little to no true steal, CPU time will exceed measured awake wall clock time, resulting in negative steal. This change "fixes" this by enforcing that steal time is monotonically increasing. This occurs at measurement time by checking if "true steal" (i.e., the old definition of steal) has increased since the last measurement and adding that delta to our monotonic steal counter if so. Otherwise the delta is dropped. While not totally ideal this leads to a useful metric which mostly clamps away the error related to negative steal times, and more importantly avoids the catastrophic failure of PromQL functions when used on non-monotonic functions. Fixes scylladb#1521. Fixes scylladb#2374. (cherry picked from commit 8efeeb0)
We add two options to set the recv and send (SO_RCVBUF, ...) buffer sizes on a listening socket (server_socket). This is mostly useful to propagate said sizes to all sockets returned by accept(). It is already possible to set the socket option directly on the connected socket after it returned by accept() but experimentally this results in a socket with the specified buffer size but whose receive window will not be advertised to the client beyond the default (64K for current typical kernel defaults). So you get only some of the benefit of the larger buffer. Setting the buffer size on the listening socket, however, is mentioned as the correct approach in tcp(7) and does not suffer from the same limitation. A test is included which checks that the mechanism, including the inheritance, works. Closes scylladb#2458 (cherry picked from commit 4cb7f8e)
860747e
to
4663e75
Compare
travisdowns
approved these changes
Oct 23, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Downstream: