Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downstream steal and TCP buf fixes #148

Merged
merged 6 commits into from
Oct 23, 2024

Conversation

reactor::_total_sleep, a std::chrono::duration, was left uninitialized
in the constructor, potentially leading to wild values.

(cherry picked from commit 1c930c4)
Conceptually at all wall-clock moments the reactor is in one of three
states:

awake-busy
awake-idle
sleeping

Roughly speaking awake-busy is when the reactor has tasks or run or IO
to submit or reap, or there are waiting messages for this shard on the
cross-shard queues. This is the time counted as utilized in
"reactor utilization".

awake-idle means that the reactor is still running (or wanting to run
on the CPU), but does not have any work to do _now_, rather it spends
its time polling the cross-shard queues for new work, seeing if IO has
arrived to reap etc. From the OS point of view it is still demanding
CPU time just like the awake-busy state, but in the internal reactor
accounting we know it is not making forward progress on work, rather
just waiting for work to arrive.

sleeping - after some time in the awake-idle state and if certain
other conditions are met, the reactor may actually go to sleep. In this
state is uses an OS blocking call like io_getevents with a timeout to
deschedule itself from the CPU and be woken only when additional work
arrives (which will be detected somehow by the OS blocking mechanism).
In this state the reactor is not using additional CPU.

Having a clear view of these states in the reactor metrics is useful
and this change adds all three metrics with names similar to the above.

(cherry picked from commit 4c310e8)
We track sleep time by taking timestamps before and after we start and
stop "sleep". This accounting is key for things like steal time, which
uses sleep time to subtract from real time to determine the time the
reactor wanted to run.

Currently we take the start timestamp far too early, and also in many
cases where we do not actually sleep: we take it before we "reap events
+ maybe sleep" but in many cases we do not sleep, yet count it as
sleep time. So we can do significant work in time which is being
recorded as "sleep".

On specific effect of this is that steal time can be negative, because
of the way steal time is calculated: inflated sleep time leads to
CPU time used (rusage) to be greater than reactor awake time (wall
clock based).

After this change, steal time can still be slightly negative but the
effect is reduced by 1 or 2 orders of magnitude.

(cherry picked from commit 0be77de)
Six years ago this function was added/emptied but it hasn't had anything
added to give it a purpose in life so it seems like a good time to
remove it.

(cherry picked from commit 776fa65)
total_steal_time() can decrease in value from call to call, a violation
of the rule that a counter metric must never decrease.

This happens because steal time is "awake time (wall clock)" minus
"cpu thread time (get_rusage style CPU time)". The awake time is itself
composed of "time since reactor start" minus "accumulated sleep time",
but it can be under-counted because sleep time is over-counted: as it's
not possible to determine exactly the true sleep time, only get
timestamps before and after a period you think might involve a sleep.

Currently, sleep is even more significantly over-counted than the error
described above as it is measured at a point which includes significant
non-sleep work.

The result is that when there is little to no true steal, CPU time will
exceed measured awake wall clock time, resulting in negative steal.

This change "fixes" this by enforcing that steal time is monotonically
increasing. This occurs at measurement time by checking if "true steal"
(i.e., the old definition of steal) has increased since the last
measurement and adding that delta to our monotonic steal counter if so.
Otherwise the delta is dropped.

While not totally ideal this leads to a useful metric which mostly
clamps away the error related to negative steal times, and more
importantly avoids the catastrophic failure of PromQL functions when
used on non-monotonic functions.

Fixes scylladb#1521.
Fixes scylladb#2374.

(cherry picked from commit 8efeeb0)
We add two options to set the recv and send (SO_RCVBUF, ...) buffer
sizes on a listening socket (server_socket). This is mostly useful to
propagate said sizes to all sockets returned by accept().

It is already possible to set the socket option directly on the
connected socket after it returned by accept() but experimentally this
results in a socket with the specified buffer size but whose
receive window will not be advertised to the client beyond the default
(64K for current typical kernel defaults). So you get only some of the
benefit of the larger buffer.

Setting the buffer size on the listening socket, however, is mentioned
as the correct approach in tcp(7) and does not suffer from the same
limitation.

A test is included which checks that the mechanism, including the
inheritance, works.

Closes scylladb#2458

(cherry picked from commit 4cb7f8e)
@StephanDollberg StephanDollberg force-pushed the stephan/downstream-234234 branch from 860747e to 4663e75 Compare October 23, 2024 10:38
@StephanDollberg StephanDollberg merged commit 7820b86 into v24.3.x Oct 23, 2024
0 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants