servstate: test stability improvements #266

flotter · 2023-07-28T15:19:03Z

servstate: backoff svcs should be stopped on exit

The default behaviour today for a service that exits is that Pebble will
auto-restart the service. This process has a built in back-off strategy
which means that on exit a timer is set to restart the service after
the back-off period expires.

The way to prevent a service in back-off to restart is to explicitly stop
the service (this case is covered in a unit test).

When the Pebble daemon starts a graceful exit, it has to first stop all
running services by requesting this explicitly, and waiting for the change
request to complete.

However, services currently scheduled for restart (back-off) is not currently
stopped through this request. This is an issue because the restart mechanism
relies on the reaper package, which is stopped straight after running services
are stopped.

This means any service in back-off can find itself in one of the following
situations:

The most common case is that the service restart thread is killed by the
garbage collector on process exit, so this appears clean.
The second scenario is the service restarts between the services stop
request and the reaper kill request, in which case the service will
continue to run after the Pebble process is dead (of course this is
statistically very unlikely to happen, but possible).
The last scenario is that the service restart after the reaper is killed,
but before the Pebble process exits. This will result in the "reaper not
started" panic we often see in unit tests (this is also a statistically
unlikely scenario, but possible).

This PR modifies the servstate function used for terminating all services
during shutdown (only) to also explicitly stop services in back-off.

servstate: unit test teardown must stop svcs

Currently running unit test services are overlapping each other as there
is no explicit services stop on test tear-down. This is not directly a
problem although it makes interpreting the unit test output impossible,
as the servstate messages you see during the test also includes output
from service events of previous tests.

There is however one real issue: services that restart after a
unit test exit (due to expiring back-off), requires the original
test reaper instance to start and wait on the service. The reaper is
correctly destroyed during test teardown, so as a result we constantly
get error messages like:

panic: reaper not started

This PR includes the following work:

Reorganise the test setup and tear-down logic to allow consistent
management of servstate manager instances (some tests use a default
setup template, while others create a test specific instance).
Enforce clean service shutdown (running and back-off) during test
teardown.
Arrange the unit tests to be first in the file, with helper
functions at the end, to allow for easy reading.

servstate: remove test races on svcs file output

Unit Tests for servstate is extremely timing sensitive. If you run the unit
tests under heavy CPU load, changes are very good you will start to see
various failures. The easiest way to demonstrate this is to use the stress
utility to simulate heavy CPU load (stress --io 12 --cpu 12).

Unit Test output snippet:

START: manager_test.go:828: S.TestEnvironment

... obtained string = ""
... expected string = "" +
...     "PEBBLE_ENV_TEST_1=foo\n" +
...     "PEBBLE_ENV_TEST_2=bar bazz\n" +
...     "PEBBLE_ENV_TEST_PARENT=from-parent\n"

FAIL: manager_test.go:828: S.TestEnvironment

This specific race condition is caused because unit tests often make
assumptions on how long it takes from the moment the service is started
until the moment the service command completes.

This PR will only address this very specific kind of race, and hopefully
in the process provide a solution that will be used in future tests.

Please note that using the stress utility on tests from this package will
still trigger lots of remaining races caused by very timing sensitive tests.

servstate: unify log assert and done check

The startTestServices() test helper uses a special entry in the
service command under test to write the service standard output
also to a log file that can be inspected.

This mechanism suffers from a race condition (as highlighted in
#264) because when the
content of the log file is loaded, the service may not yet have
completed writing to the log.

Since standard output is also verified separately through a
different mechanism, the follservstate: unify log assert and done check

The startTestServices() test helper uses a special entry in the
service command under test to write the service standard output
also to a log file that can be inspected.

This mechanism suffers from a race condition (as highlighted in
#264) because when the
content of the log file is loaded, the service may not yet have
completed writing to the log.

Since standard output is also verified separately through a
different mechanism, the following changes are made:

Enhance the global "done check" (previous called "done file") to
check completion per service. This effectively adds the
capability previously provided by the log assert mechanism.
Use the existing "done check" mechanism to wait until the service
command-line is complete up to the point of the check.
Only now verify the stdout buffer content as checked previously.
Remove the log mechanism all together.

benhoyt

This is great, thanks for the TLC, @flotter! I'd been thinking of this kind of cleanup to this test file for a while, but had never gotten to it.

I've added a bunch of comments, but they're all nits / minor.

One thing I wondered about was whether S should have the fields like manager at all. Maybe it'd be clearer if the setup method just returned the manager, so everything is scoped to just the test at hand. I'm guessing it's a bit awkward because then you have to pass around the log buffer and state and runner to the relevant funcs. But I'm interested in your take. I'm happy leaving that too (forever or for future work) -- it's fine to make incremental progress.

internals/overlord/servstate/manager.go

internals/overlord/servstate/manager_test.go

paul-rodriguez

Reaper vs stopRunningServices race?

internals/overlord/servstate/manager_test.go

internals/overlord/servstate/manager.go

The default behaviour today for a service that exits is that Pebble will auto-restart the service. This process has a built in back-off strategy which means that on exit a timer is set to restart the service after the back-off period expires. The way to prevent a service in back-off to restart is to explicitly stop the service (this case is covered in a unit test). When the Pebble daemon starts a graceful exit, it has to first stop all running services by requesting this explicitly, and waiting for the change request to complete. However, services currently scheduled for restart (back-off) is not currently stopped through this request. This is an issue because the restart mechanism relies on the reaper package, which is stopped straight after running services are stopped. This means any service in back-off can find itself in one of the following situations: 1. The most common case is that the service restart thread is killed by the garbage collector on process exit, so this appears clean. 2. The second scenario is the service restarts between the services stop request and the reaper kill request, in which case the service will continue to run after the Pebble process is dead (of course this is statistically very unlikely to happen, but possible). 3. The last scenario is that the service restart after the reaper is killed, but before the Pebble process exits. This will result in the "reaper not started" panic we often see in unit tests (this is also a statistically unlikely scenario, but possible). This PR modifies the servstate function used for terminating all services during shutdown (only) to also explicitly stop services in back-off.

Currently running unit test services are overlapping each other as there is no explicit services stop on test tear-down. This is not directly a problem although it makes interpreting the unit test output impossible, as the servstate messages you see during the test also includes output from service events of previous tests. There is however one real issue: services that restart after a unit test exits (due to expiring back-off), requires the original test reaper instance to start and wait on the service. The reaper is correctly destroyed during test teardown, so as a result we constantly get error messages like: panic: reaper not started This PR includes the following work: - Reorganise the test setup and tear-down logic to allow consistent management of servstate manager instances (some tests use a default setup template, while others create a test specific instance). - Enforce clean service shutdown (running and back-off) during test teardown. - Arrange the unit tests to be first in the file, with helper functions at the end, to allow for easy reading.

Unit Tests for servstate is extremely timing sensitive. If you run the unit tests under heavy CPU load, changes are very good you will start to see various failures. The easiest way to demonstrate this is to use the stress utility to simulate heavy CPU load (stress --io 12 --cpu 12). Unit Test output snippet: START: manager_test.go:828: S.TestEnvironment ... obtained string = "" ... expected string = "" + ... "PEBBLE_ENV_TEST_1=foo\n" + ... "PEBBLE_ENV_TEST_2=bar bazz\n" + ... "PEBBLE_ENV_TEST_PARENT=from-parent\n" FAIL: manager_test.go:828: S.TestEnvironment This specific race condition is caused because unit tests often make assumptions on how long it takes from the moment the service is started until the moment the service command completes. This PR will only address this very specific kind of race, and hopefully in the process provide a solution that will be used in future tests. Please note that using the stress utility on tests from this package will still trigger lots of remaining races caused by very timing sensitive tests.

paul-rodriguez · 2023-07-31T16:01:35Z

Thanks for the clarifications ! 🟢

benhoyt

Thanks! This is a great maintainability and reliability improvement.

benhoyt · 2023-08-01T00:43:34Z

@flotter Do you think this is ready to merge? If so, I can go ahead and do that.

The startTestServices() test helper uses a special entry in the service command under test to write the service standard output also to a log file that can be inspected. This mechanism suffers from a race condition (as highlighted in canonical#264) because when the content of the log file is loaded, the service may not yet have completed writing to the log. Since standard output is also verified separately through a different mechanism, the following changes are made: - Enhance the global "done check" (previous called "done file") to check completion per service. This effectively adds the capability previously provided by the log assert mechanism. - Use the existing "done check" mechanism to wait until the service command-line is complete up to the point of the check. - Only now verify the stdout buffer content as checked previously. - Remove the log mechanism all together.

#16066 Mostly includes test fixes and CI changes, but significant user-visible changes are: * Do not inherit HOME and USER env vars from pebble daemon environment (canonical/pebble#262) * servstate: backoff svcs should be stopped on exit (canonical/pebble#266)

flotter requested a review from benhoyt July 28, 2023 15:21

flotter mentioned this pull request Jul 28, 2023

Test race conditions (including cases only seen under heavy cpu load) #181

Closed

flotter requested review from anpep and paul-rodriguez July 28, 2023 15:27

flotter mentioned this pull request Jul 28, 2023

Data race in internals/overlord/servstate:S.startTestServices(c) #264

Closed

benhoyt reviewed Jul 28, 2023

View reviewed changes

paul-rodriguez reviewed Jul 31, 2023

View reviewed changes

internals/overlord/servstate/manager_test.go Show resolved Hide resolved

internals/overlord/servstate/manager.go Show resolved Hide resolved

flotter added 3 commits July 31, 2023 16:40

paul-rodriguez approved these changes Jul 31, 2023

View reviewed changes

flotter force-pushed the servstate-race-fix branch from c6e4f6d to 5566aae Compare July 31, 2023 16:03

benhoyt approved these changes Aug 1, 2023

View reviewed changes

flotter mentioned this pull request Aug 1, 2023

Intermittent failure in servstate tests #224

Closed

This was linked to issues Aug 1, 2023

Intermittent failure in servstate tests #224

Closed

Data race in internals/overlord/servstate:S.startTestServices(c) #264

Closed

jnsgruk merged commit bf8a4d4 into canonical:master Aug 1, 2023
18 checks passed

barrettj12 mentioned this pull request Aug 2, 2023

Intermittent test failures: servstate.TestOnCheckFailure* #270

Closed

benhoyt mentioned this pull request Aug 9, 2023

Update Pebble version to latest juju/juju#16066

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

servstate: test stability improvements #266

servstate: test stability improvements #266

flotter commented Jul 28, 2023 •

edited

Loading

benhoyt left a comment

paul-rodriguez left a comment

paul-rodriguez commented Jul 31, 2023

benhoyt left a comment

benhoyt commented Aug 1, 2023

servstate: test stability improvements #266

servstate: test stability improvements #266

Conversation

flotter commented Jul 28, 2023 • edited Loading

benhoyt left a comment

Choose a reason for hiding this comment

paul-rodriguez left a comment

Choose a reason for hiding this comment

paul-rodriguez commented Jul 31, 2023

benhoyt left a comment

Choose a reason for hiding this comment

benhoyt commented Aug 1, 2023

flotter commented Jul 28, 2023 •

edited

Loading