Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

servstate: test stability improvements #266

Merged
merged 4 commits into from
Aug 1, 2023

Conversation

flotter
Copy link
Contributor

@flotter flotter commented Jul 28, 2023

servstate: backoff svcs should be stopped on exit

The default behaviour today for a service that exits is that Pebble will
auto-restart the service. This process has a built in back-off strategy
which means that on exit a timer is set to restart the service after
the back-off period expires.

The way to prevent a service in back-off to restart is to explicitly stop
the service (this case is covered in a unit test).

When the Pebble daemon starts a graceful exit, it has to first stop all
running services by requesting this explicitly, and waiting for the change
request to complete.

However, services currently scheduled for restart (back-off) is not currently
stopped through this request. This is an issue because the restart mechanism
relies on the reaper package, which is stopped straight after running services
are stopped.

This means any service in back-off can find itself in one of the following
situations:

  1. The most common case is that the service restart thread is killed by the
    garbage collector on process exit, so this appears clean.

  2. The second scenario is the service restarts between the services stop
    request and the reaper kill request, in which case the service will
    continue to run after the Pebble process is dead (of course this is
    statistically very unlikely to happen, but possible).

  3. The last scenario is that the service restart after the reaper is killed,
    but before the Pebble process exits. This will result in the "reaper not
    started" panic we often see in unit tests (this is also a statistically
    unlikely scenario, but possible).

This PR modifies the servstate function used for terminating all services
during shutdown (only) to also explicitly stop services in back-off.

servstate: unit test teardown must stop svcs

Currently running unit test services are overlapping each other as there
is no explicit services stop on test tear-down. This is not directly a
problem although it makes interpreting the unit test output impossible,
as the servstate messages you see during the test also includes output
from service events of previous tests.

There is however one real issue: services that restart after a
unit test exit (due to expiring back-off), requires the original
test reaper instance to start and wait on the service. The reaper is
correctly destroyed during test teardown, so as a result we constantly
get error messages like:

panic: reaper not started

This PR includes the following work:

  • Reorganise the test setup and tear-down logic to allow consistent
    management of servstate manager instances (some tests use a default
    setup template, while others create a test specific instance).

  • Enforce clean service shutdown (running and back-off) during test
    teardown.

  • Arrange the unit tests to be first in the file, with helper
    functions at the end, to allow for easy reading.

servstate: remove test races on svcs file output

Unit Tests for servstate is extremely timing sensitive. If you run the unit
tests under heavy CPU load, changes are very good you will start to see
various failures. The easiest way to demonstrate this is to use the stress
utility to simulate heavy CPU load (stress --io 12 --cpu 12).

Unit Test output snippet:

START: manager_test.go:828: S.TestEnvironment

... obtained string = ""
... expected string = "" +
...     "PEBBLE_ENV_TEST_1=foo\n" +
...     "PEBBLE_ENV_TEST_2=bar bazz\n" +
...     "PEBBLE_ENV_TEST_PARENT=from-parent\n"

FAIL: manager_test.go:828: S.TestEnvironment

This specific race condition is caused because unit tests often make
assumptions on how long it takes from the moment the service is started
until the moment the service command completes.

This PR will only address this very specific kind of race, and hopefully
in the process provide a solution that will be used in future tests.

Please note that using the stress utility on tests from this package will
still trigger lots of remaining races caused by very timing sensitive tests.

servstate: unify log assert and done check

The startTestServices() test helper uses a special entry in the
service command under test to write the service standard output
also to a log file that can be inspected.

This mechanism suffers from a race condition (as highlighted in
#264) because when the
content of the log file is loaded, the service may not yet have
completed writing to the log.

Since standard output is also verified separately through a
different mechanism, the follservstate: unify log assert and done check

The startTestServices() test helper uses a special entry in the
service command under test to write the service standard output
also to a log file that can be inspected.

This mechanism suffers from a race condition (as highlighted in
#264) because when the
content of the log file is loaded, the service may not yet have
completed writing to the log.

Since standard output is also verified separately through a
different mechanism, the following changes are made:

  • Enhance the global "done check" (previous called "done file") to
    check completion per service. This effectively adds the
    capability previously provided by the log assert mechanism.

  • Use the existing "done check" mechanism to wait until the service
    command-line is complete up to the point of the check.

  • Only now verify the stdout buffer content as checked previously.

  • Remove the log mechanism all together.

Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks for the TLC, @flotter! I'd been thinking of this kind of cleanup to this test file for a while, but had never gotten to it.

I've added a bunch of comments, but they're all nits / minor.

One thing I wondered about was whether S should have the fields like manager at all. Maybe it'd be clearer if the setup method just returned the manager, so everything is scoped to just the test at hand. I'm guessing it's a bit awkward because then you have to pass around the log buffer and state and runner to the relevant funcs. But I'm interested in your take. I'm happy leaving that too (forever or for future work) -- it's fine to make incremental progress.

internals/overlord/servstate/manager.go Show resolved Hide resolved
internals/overlord/servstate/manager_test.go Outdated Show resolved Hide resolved
internals/overlord/servstate/manager_test.go Outdated Show resolved Hide resolved
internals/overlord/servstate/manager_test.go Outdated Show resolved Hide resolved
internals/overlord/servstate/manager_test.go Outdated Show resolved Hide resolved
internals/overlord/servstate/manager_test.go Show resolved Hide resolved
internals/overlord/servstate/manager_test.go Show resolved Hide resolved
internals/overlord/servstate/manager_test.go Outdated Show resolved Hide resolved
internals/overlord/servstate/manager_test.go Show resolved Hide resolved
internals/overlord/servstate/manager_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@paul-rodriguez paul-rodriguez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reaper vs stopRunningServices race?

internals/overlord/servstate/manager_test.go Show resolved Hide resolved
internals/overlord/servstate/manager.go Show resolved Hide resolved
The default behaviour today for a service that exits is that Pebble will
auto-restart the service. This process has a built in back-off strategy
which means that on exit a timer is set to restart the service after
the back-off period expires.

The way to prevent a service in back-off to restart is to explicitly stop
the service (this case is covered in a unit test).

When the Pebble daemon starts a graceful exit, it has to first stop all
running services by requesting this explicitly, and waiting for the change
request to complete.

However, services currently scheduled for restart (back-off) is not currently
stopped through this request. This is an issue because the restart mechanism
relies on the reaper package, which is stopped straight after running services
are stopped.

This means any service in back-off can find itself in one of the following
situations:

1. The most common case is that the service restart thread is killed by the
   garbage collector on process exit, so this appears clean.

2. The second scenario is the service restarts between the services stop
   request and the reaper kill request, in which case the service will
   continue to run after the Pebble process is dead (of course this is
   statistically very unlikely to happen, but possible).

3. The last scenario is that the service restart after the reaper is killed,
   but before the Pebble process exits. This will result in the "reaper not
   started" panic we often see in unit tests (this is also a statistically
   unlikely scenario, but possible).

This PR modifies the servstate function used for terminating all services
during shutdown (only) to also explicitly stop services in back-off.
Currently running unit test services are overlapping each other as there
is no explicit services stop on test tear-down. This is not directly a
problem although it makes interpreting the unit test output impossible,
as the servstate messages you see during the test also includes output
from service events of previous tests.

There is however one real issue: services that restart after a
unit test exits (due to expiring back-off), requires the original
test reaper instance to start and wait on the service. The reaper is
correctly destroyed during test teardown, so as a result we constantly
get error messages like:

panic: reaper not started

This PR includes the following work:

- Reorganise the test setup and tear-down logic to allow consistent
  management of servstate manager instances (some tests use a default
  setup template, while others create a test specific instance).

- Enforce clean service shutdown (running and back-off) during test
  teardown.

- Arrange the unit tests to be first in the file, with helper
  functions at the end, to allow for easy reading.
Unit Tests for servstate is extremely timing sensitive. If you run the unit
tests under heavy CPU load, changes are very good you will start to see
various failures. The easiest way to demonstrate this is to use the stress
utility to simulate heavy CPU load (stress --io 12 --cpu 12).

Unit Test output snippet:

START: manager_test.go:828: S.TestEnvironment

... obtained string = ""
... expected string = "" +
...     "PEBBLE_ENV_TEST_1=foo\n" +
...     "PEBBLE_ENV_TEST_2=bar bazz\n" +
...     "PEBBLE_ENV_TEST_PARENT=from-parent\n"

FAIL: manager_test.go:828: S.TestEnvironment

This specific race condition is caused because unit tests often make
assumptions on how long it takes from the moment the service is started
until the moment the service command completes.

This PR will only address this very specific kind of race, and hopefully
in the process provide a solution that will be used in future tests.

Please note that using the stress utility on tests from this package will
still trigger lots of remaining races caused by very timing sensitive tests.
@paul-rodriguez
Copy link
Contributor

Thanks for the clarifications ! 🟢

Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is a great maintainability and reliability improvement.

@benhoyt
Copy link
Contributor

benhoyt commented Aug 1, 2023

@flotter Do you think this is ready to merge? If so, I can go ahead and do that.

The startTestServices() test helper uses a special entry in the
service command under test to write the service standard output
also to a log file that can be inspected.

This mechanism suffers from a race condition (as highlighted in
canonical#264) because when the
content of the log file is loaded, the service may not yet have
completed writing to the log.

Since standard output is also verified separately through a
different mechanism, the following changes are made:

- Enhance the global "done check" (previous called "done file") to
  check completion per service. This effectively adds the
  capability previously provided by the log assert mechanism.

- Use the existing "done check" mechanism to wait until the service
  command-line is complete up to the point of the check.

- Only now verify the stdout buffer content as checked previously.

- Remove the log mechanism all together.
@jnsgruk jnsgruk merged commit bf8a4d4 into canonical:master Aug 1, 2023
18 checks passed
jujubot added a commit to juju/juju that referenced this pull request Aug 10, 2023
#16066

Mostly includes test fixes and CI changes, but significant user-visible changes are:

* Do not inherit HOME and USER env vars from pebble daemon environment (canonical/pebble#262)
* servstate: backoff svcs should be stopped on exit (canonical/pebble#266)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data race in internals/overlord/servstate:S.startTestServices(c) Intermittent failure in servstate tests
4 participants