Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(daemon): avoid acquiring state lock in health check endpoint #369

Merged
merged 1 commit into from
Feb 28, 2024

Commits on Feb 27, 2024

  1. fix(daemon): avoid acquiring state lock in health check endpoint

    I don't love the addition of the (exported) State.LockCount method, but
    I couldn't think of another way to ensure that State.Lock was not being
    called, apart from significant surgery allowing us to swap out the
    state lock with an interface.
    
    In addition, I'm not sure we need the maintenance/warnings stuff at all
    in Pebble, as they're not used, but leaving that for a separate
    discussion and PR.
    
    I tested by saving the script https://go.dev/play/p/LJDLEaXeBpd to
    cmd/hithealth/main.go, running a Pebble server, and then running a
    task to heavily load my CPU cores (though the latter doesn't seem to
    make much of a difference!).
    
    Without the fix, GET /v1/health response times are all over 20ms, and
    commonly in the 100-300ms range. In other words, very long for a
    do-nothing health check endpoint. This is because the other clients
    (in the hithealth script) are starting services which modifies state
    and holds the state lock for a relatively long time.
    
    $ go run ./cmd/hithealth/
         0 > 20ms: 0.028897s
         1 > 20ms: 0.192796s
         2 > 20ms: 0.268904s
         3 > 20ms: 0.082985s
         4 > 20ms: 0.030554s
         5 > 20ms: 0.156912s
         6 > 20ms: 0.245212s
         7 > 20ms: 0.047436s
         8 > 20ms: 0.099474s
         9 > 20ms: 0.070440s
        10 > 20ms: 0.060641s
        11 > 20ms: 0.202535s
        12 > 20ms: 0.206226s
        ...
    
    With the fix, GET /v1/health doesn't hold the state lock at all.
    Response times are only over 20ms every 10,000 or so requests, and
    never above 100ms even under heavy load:
    
    $ go run ./cmd/hithealth/
     13891 > 20ms: 0.023703s
     15923 > 20ms: 0.024769s
     21915 > 20ms: 0.076423s
     ...
    benhoyt committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    d28cf9f View commit details
    Browse the repository at this point in the history