Downstream steal and TCP buf fixes #148

StephanDollberg · 2024-10-23T10:37:20Z

Downstream:

reactor::_total_sleep, a std::chrono::duration, was left uninitialized in the constructor, potentially leading to wild values. (cherry picked from commit 1c930c4)

Conceptually at all wall-clock moments the reactor is in one of three states: awake-busy awake-idle sleeping Roughly speaking awake-busy is when the reactor has tasks or run or IO to submit or reap, or there are waiting messages for this shard on the cross-shard queues. This is the time counted as utilized in "reactor utilization". awake-idle means that the reactor is still running (or wanting to run on the CPU), but does not have any work to do _now_, rather it spends its time polling the cross-shard queues for new work, seeing if IO has arrived to reap etc. From the OS point of view it is still demanding CPU time just like the awake-busy state, but in the internal reactor accounting we know it is not making forward progress on work, rather just waiting for work to arrive. sleeping - after some time in the awake-idle state and if certain other conditions are met, the reactor may actually go to sleep. In this state is uses an OS blocking call like io_getevents with a timeout to deschedule itself from the CPU and be woken only when additional work arrives (which will be detected somehow by the OS blocking mechanism). In this state the reactor is not using additional CPU. Having a clear view of these states in the reactor metrics is useful and this change adds all three metrics with names similar to the above. (cherry picked from commit 4c310e8)

We track sleep time by taking timestamps before and after we start and stop "sleep". This accounting is key for things like steal time, which uses sleep time to subtract from real time to determine the time the reactor wanted to run. Currently we take the start timestamp far too early, and also in many cases where we do not actually sleep: we take it before we "reap events + maybe sleep" but in many cases we do not sleep, yet count it as sleep time. So we can do significant work in time which is being recorded as "sleep". On specific effect of this is that steal time can be negative, because of the way steal time is calculated: inflated sleep time leads to CPU time used (rusage) to be greater than reactor awake time (wall clock based). After this change, steal time can still be slightly negative but the effect is reduced by 1 or 2 orders of magnitude. (cherry picked from commit 0be77de)

Six years ago this function was added/emptied but it hasn't had anything added to give it a purpose in life so it seems like a good time to remove it. (cherry picked from commit 776fa65)

total_steal_time() can decrease in value from call to call, a violation of the rule that a counter metric must never decrease. This happens because steal time is "awake time (wall clock)" minus "cpu thread time (get_rusage style CPU time)". The awake time is itself composed of "time since reactor start" minus "accumulated sleep time", but it can be under-counted because sleep time is over-counted: as it's not possible to determine exactly the true sleep time, only get timestamps before and after a period you think might involve a sleep. Currently, sleep is even more significantly over-counted than the error described above as it is measured at a point which includes significant non-sleep work. The result is that when there is little to no true steal, CPU time will exceed measured awake wall clock time, resulting in negative steal. This change "fixes" this by enforcing that steal time is monotonically increasing. This occurs at measurement time by checking if "true steal" (i.e., the old definition of steal) has increased since the last measurement and adding that delta to our monotonic steal counter if so. Otherwise the delta is dropped. While not totally ideal this leads to a useful metric which mostly clamps away the error related to negative steal times, and more importantly avoids the catastrophic failure of PromQL functions when used on non-monotonic functions. Fixes scylladb#1521. Fixes scylladb#2374. (cherry picked from commit 8efeeb0)

We add two options to set the recv and send (SO_RCVBUF, ...) buffer sizes on a listening socket (server_socket). This is mostly useful to propagate said sizes to all sockets returned by accept(). It is already possible to set the socket option directly on the connected socket after it returned by accept() but experimentally this results in a socket with the specified buffer size but whose receive window will not be advertised to the client beyond the default (64K for current typical kernel defaults). So you get only some of the benefit of the larger buffer. Setting the buffer size on the listening socket, however, is mentioned as the correct approach in tcp(7) and does not suffer from the same limitation. A test is included which checks that the mechanism, including the inheritance, works. Closes scylladb#2458 (cherry picked from commit 4cb7f8e)

StephanDollberg requested review from travisdowns and ballard26 October 23, 2024 10:37

travisdowns added 6 commits October 23, 2024 11:38

Zero-init total sleep time

b0e35d1

reactor::_total_sleep, a std::chrono::duration, was left uninitialized in the constructor, potentially leading to wild values. (cherry picked from commit 1c930c4)

Remove account_idle

666e8b2

Six years ago this function was added/emptied but it hasn't had anything added to give it a purpose in life so it seems like a good time to remove it. (cherry picked from commit 776fa65)

StephanDollberg force-pushed the stephan/downstream-234234 branch from 860747e to 4663e75 Compare October 23, 2024 10:38

travisdowns approved these changes Oct 23, 2024

View reviewed changes

StephanDollberg merged commit 7820b86 into v24.3.x Oct 23, 2024
0 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downstream steal and TCP buf fixes #148

Downstream steal and TCP buf fixes #148

StephanDollberg commented Oct 23, 2024

Downstream steal and TCP buf fixes #148

Downstream steal and TCP buf fixes #148

Conversation

StephanDollberg commented Oct 23, 2024