-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(daemon): avoid acquiring state lock in health check endpoint #369
Conversation
I don't love the addition of the (exported) State.LockCount method, but I couldn't think of another way to ensure that State.Lock was not being called, apart from significant surgery allowing us to swap out the state lock with an interface. In addition, I'm not sure we need the maintenance/warnings stuff at all in Pebble, as they're not used, but leaving that for a separate discussion and PR. I tested by saving the script https://go.dev/play/p/LJDLEaXeBpd to cmd/hithealth/main.go, running a Pebble server, and then running a task to heavily load my CPU cores (though the latter doesn't seem to make much of a difference!). Without the fix, GET /v1/health response times are all over 20ms, and commonly in the 100-300ms range. In other words, very long for a do-nothing health check endpoint. This is because the other clients (in the hithealth script) are starting services which modifies state and holds the state lock for a relatively long time. $ go run ./cmd/hithealth/ 0 > 20ms: 0.028897s 1 > 20ms: 0.192796s 2 > 20ms: 0.268904s 3 > 20ms: 0.082985s 4 > 20ms: 0.030554s 5 > 20ms: 0.156912s 6 > 20ms: 0.245212s 7 > 20ms: 0.047436s 8 > 20ms: 0.099474s 9 > 20ms: 0.070440s 10 > 20ms: 0.060641s 11 > 20ms: 0.202535s 12 > 20ms: 0.206226s ... With the fix, GET /v1/health doesn't hold the state lock at all. Response times are only over 20ms every 10,000 or so requests, and never above 100ms even under heavy load: $ go run ./cmd/hithealth/ 13891 > 20ms: 0.023703s 15923 > 20ms: 0.024769s 21915 > 20ms: 0.076423s ...
@benhoyt I only have a general question. The amount of latency on the health requested state lock (~200ms) in the worse case seems quote bad. I wonder if the WaitChange() that is also part of the service starting loop (10x concurrently) in your test script is is also attributing to state lock contention. I am probably missing some data on why this happens, but I would be surprised if any one state lock user holds the lock for a very long time, so even if a lot of state lock operations are interleaved, I am still a bit surprised that a single state lock request has to wait that long (even the average feels very long). |
@flotter Thanks for the review. Yes, the I did some further timing tests of the different slow parts of
Two things to note: 1) It's quite slow just to do the JSON marshalling (this is because state has grown quite large, like several hundred KB of JSON, with quite a few Changes being recorded rapidly), and 2) The checkpoint/save is slow, not due to the file writing, but due to the fsync. I'd like to consider improving either or both of these separately, for example not doing the fsync (maybe a bad idea?) or only saving every (say) 100ms at most. Or moving to a real database like SQLite... :-) |
Going to merge this as is, seeing this is a fairly high-priority issue -- we can always improve on this implementation detail later. |
I don't love the addition of the (exported) State.LockCount method, but I couldn't think of another way to ensure that State.Lock was not being called, apart from significant surgery allowing us to swap out the state lock with an interface. In addition, I'm not sure we need the maintenance/warnings stuff at all in Pebble, as they're not used, but leaving that for a separate discussion and PR. I tested by saving the script https://go.dev/play/p/LJDLEaXeBpd to cmd/hithealth/main.go, running a Pebble server, and then running a task to heavily load my CPU cores (though the latter doesn't seem to make much of a difference!). Without the fix, GET /v1/health response times are all over 20ms, and commonly in the 100-300ms range. In other words, very long for a do-nothing health check endpoint. This is because the other clients (in the hithealth script) are starting services which modifies state and holds the state lock for a relatively long time. $ go run ./cmd/hithealth/ 0 > 20ms: 0.028897s 1 > 20ms: 0.192796s 2 > 20ms: 0.268904s 3 > 20ms: 0.082985s 4 > 20ms: 0.030554s 5 > 20ms: 0.156912s 6 > 20ms: 0.245212s 7 > 20ms: 0.047436s 8 > 20ms: 0.099474s 9 > 20ms: 0.070440s 10 > 20ms: 0.060641s 11 > 20ms: 0.202535s 12 > 20ms: 0.206226s ... With the fix, GET /v1/health doesn't hold the state lock at all. Response times are only over 20ms every 10,000 or so requests, and never above 100ms even under heavy load: $ go run ./cmd/hithealth/ 13891 > 20ms: 0.023703s 15923 > 20ms: 0.024769s 21915 > 20ms: 0.076423s ...
This Pebble version adds the following since v1.7.1: - canonical/pebble#368: replace "select=all" with "users=all". Strictly speaking this is a breaking change, however, it was a design mistake and we want to fix it as soon as possible. It's almost certainly that no one is using this feature yet. - canonical/pebble#369: avoid acquiring state lock in health check endpoint. This fixes (or is one aspect of the fix for) the issue described in https://bugs.launchpad.net/juju/+bug/2052517, so that the GET /v1/health endpoint returns much quicker even when under load. In addition, this has the side effect of bumping up a few of the golang.org/x packages, as Pebble depends on higher versions.
This Pebble version adds the following since v1.7.1: - canonical/pebble#368: replace "select=all" with "users=all". Strictly speaking this is a breaking change, however, it was a design mistake and we want to fix it as soon as possible. It's almost certainly that no one is using this feature yet. - canonical/pebble#369: avoid acquiring state lock in health check endpoint. This fixes (or is one aspect of the fix for) the issue described in https://bugs.launchpad.net/juju/+bug/2052517, so that the GET /v1/health endpoint returns much quicker even when under load.
#16981 This Pebble version adds the following since v1.7.1: - canonical/pebble#368: replace "select=all" with "users=all". Strictly speaking this is a breaking change, however, it was a design mistake and we want to fix it as soon as possible. It's almost certainly that no one is using this feature yet. - canonical/pebble#369: avoid acquiring state lock in health check endpoint. This fixes (or is one aspect of the fix for) the issue described in https://bugs.launchpad.net/juju/+bug/2052517, so that the GET /v1/health endpoint returns much quicker even when under load.
#16982 This Pebble version adds the following since v1.7.1: - canonical/pebble#368: replace "select=all" with "users=all". Strictly speaking this is a breaking change, however, it was a design mistake and we want to fix it as soon as possible. It's almost certainly that no one is using this feature yet. - canonical/pebble#369: avoid acquiring state lock in health check endpoint. This fixes (or is one aspect of the fix for) the issue described in https://bugs.launchpad.net/juju/+bug/2052517, so that the GET /v1/health endpoint returns much quicker even when under load. In addition, this has the side effect of bumping up a few of the golang.org/x packages, as Pebble depends on higher versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving retroactively, it seems fine. Thank-you for trying to get to the bottom of this.
// | ||
// - https://github.com/canonical/pebble/issues/366 | ||
// - https://bugs.launchpad.net/juju/+bug/2052517 | ||
func (s *apiSuite) TestStateLockNotHeld(c *C) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than adding the LockCount, I would have just acquired the lock, then made the call to the health endpoint. If it times out, then it depends on the lock. But at least the lock count allows you to find when you should start testing from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's a very good idea! And if I do it that way it doesn't actually matter when I start testing from: whether the test acquires the lock before or after the overlord startup goroutine did or not, it doesn't matter. Fixed in #373 (and while I was in there, I found that the /v1/health error case still acquired the state lock, so I fixed that too).
As Harry pointed out at canonical#369 (comment), there's a much simpler way to test this without needing a new State method like LockCount. Just acquire the state lock, then call the endpoint. If it times out, we know it was trying to acquire the lock. In addition, fix an issue where the health endpoint would still hold the state lock if it returned an error. Fix those and add a test for that too.
As Harry pointed out at #369 (comment), there's a much simpler way to test this without needing a new State method like LockCount. Just acquire the state lock, then call the endpoint. If it times out, we know it was trying to acquire the lock. In addition, fix an issue where the health endpoint would still hold the state lock if it returned an error. Fix those and add a test for that too.
This PR avoids acquiring state lock in health check endpoint, a localised fix for the issue described in https://bugs.launchpad.net/juju/+bug/2052517 and #366.
I don't love the addition of the (exported) State.LockCount method, but I couldn't think of another way to ensure that State.Lock was not being called, apart from significant surgery allowing us to swap out the state lock with an interface.
In addition, I'm not sure we need the maintenance/warnings stuff at all in Pebble, as they're not used, but leaving that for a separate discussion and PR.
I tested by saving the script https://go.dev/play/p/LJDLEaXeBpd to cmd/hithealth/main.go, running a Pebble server, and then running a task to heavily load my CPU cores (though the latter doesn't seem to make much of a difference!).
Without the fix, GET /v1/health response times are all over 20ms, and commonly in the 100-300ms range. In other words, very long for a do-nothing health check endpoint. This is because the other clients (in the hithealth script) are starting services which modifies state and holds the state lock for a relatively long time.
With the fix, GET /v1/health doesn't hold the state lock at all. Response times are only over 20ms every 10,000 or so requests, and never above 100ms even under heavy load: