-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
servstate: test stability improvements #266
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks for the TLC, @flotter! I'd been thinking of this kind of cleanup to this test file for a while, but had never gotten to it.
I've added a bunch of comments, but they're all nits / minor.
One thing I wondered about was whether S
should have the fields like manager
at all. Maybe it'd be clearer if the setup method just returned the manager, so everything is scoped to just the test at hand. I'm guessing it's a bit awkward because then you have to pass around the log buffer and state and runner to the relevant funcs. But I'm interested in your take. I'm happy leaving that too (forever or for future work) -- it's fine to make incremental progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reaper vs stopRunningServices race?
The default behaviour today for a service that exits is that Pebble will auto-restart the service. This process has a built in back-off strategy which means that on exit a timer is set to restart the service after the back-off period expires. The way to prevent a service in back-off to restart is to explicitly stop the service (this case is covered in a unit test). When the Pebble daemon starts a graceful exit, it has to first stop all running services by requesting this explicitly, and waiting for the change request to complete. However, services currently scheduled for restart (back-off) is not currently stopped through this request. This is an issue because the restart mechanism relies on the reaper package, which is stopped straight after running services are stopped. This means any service in back-off can find itself in one of the following situations: 1. The most common case is that the service restart thread is killed by the garbage collector on process exit, so this appears clean. 2. The second scenario is the service restarts between the services stop request and the reaper kill request, in which case the service will continue to run after the Pebble process is dead (of course this is statistically very unlikely to happen, but possible). 3. The last scenario is that the service restart after the reaper is killed, but before the Pebble process exits. This will result in the "reaper not started" panic we often see in unit tests (this is also a statistically unlikely scenario, but possible). This PR modifies the servstate function used for terminating all services during shutdown (only) to also explicitly stop services in back-off.
Currently running unit test services are overlapping each other as there is no explicit services stop on test tear-down. This is not directly a problem although it makes interpreting the unit test output impossible, as the servstate messages you see during the test also includes output from service events of previous tests. There is however one real issue: services that restart after a unit test exits (due to expiring back-off), requires the original test reaper instance to start and wait on the service. The reaper is correctly destroyed during test teardown, so as a result we constantly get error messages like: panic: reaper not started This PR includes the following work: - Reorganise the test setup and tear-down logic to allow consistent management of servstate manager instances (some tests use a default setup template, while others create a test specific instance). - Enforce clean service shutdown (running and back-off) during test teardown. - Arrange the unit tests to be first in the file, with helper functions at the end, to allow for easy reading.
Unit Tests for servstate is extremely timing sensitive. If you run the unit tests under heavy CPU load, changes are very good you will start to see various failures. The easiest way to demonstrate this is to use the stress utility to simulate heavy CPU load (stress --io 12 --cpu 12). Unit Test output snippet: START: manager_test.go:828: S.TestEnvironment ... obtained string = "" ... expected string = "" + ... "PEBBLE_ENV_TEST_1=foo\n" + ... "PEBBLE_ENV_TEST_2=bar bazz\n" + ... "PEBBLE_ENV_TEST_PARENT=from-parent\n" FAIL: manager_test.go:828: S.TestEnvironment This specific race condition is caused because unit tests often make assumptions on how long it takes from the moment the service is started until the moment the service command completes. This PR will only address this very specific kind of race, and hopefully in the process provide a solution that will be used in future tests. Please note that using the stress utility on tests from this package will still trigger lots of remaining races caused by very timing sensitive tests.
Thanks for the clarifications ! 🟢 |
c6e4f6d
to
5566aae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This is a great maintainability and reliability improvement.
@flotter Do you think this is ready to merge? If so, I can go ahead and do that. |
The startTestServices() test helper uses a special entry in the service command under test to write the service standard output also to a log file that can be inspected. This mechanism suffers from a race condition (as highlighted in canonical#264) because when the content of the log file is loaded, the service may not yet have completed writing to the log. Since standard output is also verified separately through a different mechanism, the following changes are made: - Enhance the global "done check" (previous called "done file") to check completion per service. This effectively adds the capability previously provided by the log assert mechanism. - Use the existing "done check" mechanism to wait until the service command-line is complete up to the point of the check. - Only now verify the stdout buffer content as checked previously. - Remove the log mechanism all together.
#16066 Mostly includes test fixes and CI changes, but significant user-visible changes are: * Do not inherit HOME and USER env vars from pebble daemon environment (canonical/pebble#262) * servstate: backoff svcs should be stopped on exit (canonical/pebble#266)
servstate: backoff svcs should be stopped on exit
The default behaviour today for a service that exits is that Pebble will
auto-restart the service. This process has a built in back-off strategy
which means that on exit a timer is set to restart the service after
the back-off period expires.
The way to prevent a service in back-off to restart is to explicitly stop
the service (this case is covered in a unit test).
When the Pebble daemon starts a graceful exit, it has to first stop all
running services by requesting this explicitly, and waiting for the change
request to complete.
However, services currently scheduled for restart (back-off) is not currently
stopped through this request. This is an issue because the restart mechanism
relies on the reaper package, which is stopped straight after running services
are stopped.
This means any service in back-off can find itself in one of the following
situations:
The most common case is that the service restart thread is killed by the
garbage collector on process exit, so this appears clean.
The second scenario is the service restarts between the services stop
request and the reaper kill request, in which case the service will
continue to run after the Pebble process is dead (of course this is
statistically very unlikely to happen, but possible).
The last scenario is that the service restart after the reaper is killed,
but before the Pebble process exits. This will result in the "reaper not
started" panic we often see in unit tests (this is also a statistically
unlikely scenario, but possible).
This PR modifies the servstate function used for terminating all services
during shutdown (only) to also explicitly stop services in back-off.
servstate: unit test teardown must stop svcs
Currently running unit test services are overlapping each other as there
is no explicit services stop on test tear-down. This is not directly a
problem although it makes interpreting the unit test output impossible,
as the servstate messages you see during the test also includes output
from service events of previous tests.
There is however one real issue: services that restart after a
unit test exit (due to expiring back-off), requires the original
test reaper instance to start and wait on the service. The reaper is
correctly destroyed during test teardown, so as a result we constantly
get error messages like:
panic: reaper not started
This PR includes the following work:
Reorganise the test setup and tear-down logic to allow consistent
management of servstate manager instances (some tests use a default
setup template, while others create a test specific instance).
Enforce clean service shutdown (running and back-off) during test
teardown.
Arrange the unit tests to be first in the file, with helper
functions at the end, to allow for easy reading.
servstate: remove test races on svcs file output
Unit Tests for servstate is extremely timing sensitive. If you run the unit
tests under heavy CPU load, changes are very good you will start to see
various failures. The easiest way to demonstrate this is to use the stress
utility to simulate heavy CPU load (stress --io 12 --cpu 12).
Unit Test output snippet:
This specific race condition is caused because unit tests often make
assumptions on how long it takes from the moment the service is started
until the moment the service command completes.
This PR will only address this very specific kind of race, and hopefully
in the process provide a solution that will be used in future tests.
Please note that using the stress utility on tests from this package will
still trigger lots of remaining races caused by very timing sensitive tests.
servstate: unify log assert and done check
The startTestServices() test helper uses a special entry in the
service command under test to write the service standard output
also to a log file that can be inspected.
This mechanism suffers from a race condition (as highlighted in
#264) because when the
content of the log file is loaded, the service may not yet have
completed writing to the log.
Since standard output is also verified separately through a
different mechanism, the follservstate: unify log assert and done check
The startTestServices() test helper uses a special entry in the
service command under test to write the service standard output
also to a log file that can be inspected.
This mechanism suffers from a race condition (as highlighted in
#264) because when the
content of the log file is loaded, the service may not yet have
completed writing to the log.
Since standard output is also verified separately through a
different mechanism, the following changes are made:
Enhance the global "done check" (previous called "done file") to
check completion per service. This effectively adds the
capability previously provided by the log assert mechanism.
Use the existing "done check" mechanism to wait until the service
command-line is complete up to the point of the check.
Only now verify the stdout buffer content as checked previously.
Remove the log mechanism all together.