Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ratelimits: Auto pause zombie clients #7763

Merged
merged 35 commits into from
Nov 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
d122f18
edited names, bucket, overrides, etc
kruti-s Oct 15, 2024
e966a9a
more changes made
kruti-s Oct 16, 2024
d6fed53
change error message for limiter
kruti-s Oct 16, 2024
fb3a966
edit limiter_test for err msg
kruti-s Oct 16, 2024
bdc2a13
Increment ratelimit for IssuancePaused in ra.go
kruti-s Oct 18, 2024
5bd14ab
SpendorCheck to SpendAndCheck
kruti-s Oct 18, 2024
2a02e4d
Merge remote-tracking branch 'origin/main' into 7738-auto-pause-zombi…
kruti-s Oct 22, 2024
03c3e71
addressed comments for ra.go
kruti-s Oct 24, 2024
73712f0
Merge remote-tracking branch 'refs/remotes/origin/main' into 7738-aut…
kruti-s Oct 24, 2024
d860f88
change IssuancePausedPerDomainPerAccount to FailedAuthorizationsForPa…
kruti-s Oct 24, 2024
542b7c5
change count and period for FailedAuthorizationsForPausingPerDomainPe…
kruti-s Oct 24, 2024
7857899
Merge remote-tracking branch 'refs/remotes/origin/main' into 7738-aut…
kruti-s Oct 25, 2024
35a6f63
override and default explanation changes
kruti-s Oct 25, 2024
6c4e2fe
change working override test vals
kruti-s Oct 25, 2024
97aebfc
Merge remote-tracking branch 'refs/remotes/origin/main' into 7738-aut…
kruti-s Oct 29, 2024
061f498
limiter test fixes
kruti-s Oct 30, 2024
a5ce802
Fix test
pgporada Oct 30, 2024
8a28d9c
started writing TesetResetAccountPausingLimit in ra_test.go
kruti-s Oct 30, 2024
a207b74
Progress
pgporada Oct 31, 2024
f983373
More progress
pgporada Nov 1, 2024
49b959f
Progress is progressing
pgporada Nov 4, 2024
37fb882
Comment wrapping
pgporada Nov 4, 2024
794a9bf
Add new feature flag
pgporada Nov 4, 2024
9a59484
Merge branch 'main' into 7738-auto-pause-zombie-clients
pgporada Nov 4, 2024
bfbc51a
Log error from calling sa.PauseIdentifiers
pgporada Nov 4, 2024
5ccd589
Fix spelling error
pgporada Nov 4, 2024
4bce078
Enable new feature for specific tests
pgporada Nov 5, 2024
12bf5e5
Fix tests
pgporada Nov 6, 2024
85f21e8
Fix?
pgporada Nov 6, 2024
a79d881
Merge branch 'main' into 7738-auto-pause-zombie-clients
pgporada Nov 7, 2024
cc01553
Fix tests from merging main
pgporada Nov 7, 2024
fb547cd
Cleanup comments
pgporada Nov 7, 2024
30ea374
Address comment
pgporada Nov 8, 2024
2e65dc3
Address comments
pgporada Nov 8, 2024
4c7be8c
Address next round of comments
pgporada Nov 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions features/features.go
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,13 @@ type Config struct {
// unique "INSERT ... RETURNING" functionality.
InsertAuthzsIndividually bool

// AutomaticallyPauseZombieClients configures the RA to automatically track
// limiter to be the authoritative source of rate limiting information for
// automatically pausing clients who systemically fail every validation
// attempt. When disabled, only manually paused accountID:identifier pairs
Comment on lines +122 to +125
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't parse this sentence. Specifically "to automatically track limiter to be the authoritative source ..." seems to be an editing error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird this looks like some kind of hybrid of what was there and what I put in my suggestion: #7763 (comment)

// will be rejected.
AutomaticallyPauseZombieClients bool

// IncrementRateLimits uses Redis' IncrBy, instead of Set, for rate limit
// accounting. This catches and denies spikes of requests much more
// reliably.
Expand Down
71 changes: 66 additions & 5 deletions ra/ra.go
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,7 @@ type RegistrationAuthorityImpl struct {
orderAges *prometheus.HistogramVec
inflightFinalizes prometheus.Gauge
certCSRMismatch prometheus.Counter
pauseCounter *prometheus.CounterVec
}

var _ rapb.RegistrationAuthorityServer = (*RegistrationAuthorityImpl)(nil)
Expand Down Expand Up @@ -241,6 +242,12 @@ func NewRegistrationAuthorityImpl(
})
stats.MustRegister(certCSRMismatch)

pauseCounter := prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "paused_pairs",
Help: "Number of times a pause operation is performed, labeled by paused=[bool], repaused=[bool], grace=[bool]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's too much for the help string here, but it would be good to document what repaused and grace mean. Perhaps in a comment either here or where they are incremented?

Alternately, we could remove paused=[bool], repaused=[bool], grace=[bool] from the help string since that information is available direct from Prometheus, and use some of the saved space to explain the two less obvious labels.

}, []string{"paused", "repaused", "grace"})
stats.MustRegister(pauseCounter)

issuersByNameID := make(map[issuance.NameID]*issuance.Certificate)
for _, issuer := range issuers {
issuersByNameID[issuer.NameID()] = issuer
Expand Down Expand Up @@ -276,6 +283,7 @@ func NewRegistrationAuthorityImpl(
orderAges: orderAges,
inflightFinalizes: inflightFinalizes,
certCSRMismatch: certCSRMismatch,
pauseCounter: pauseCounter,
}
return ra
}
Expand Down Expand Up @@ -1810,15 +1818,17 @@ func (ra *RegistrationAuthorityImpl) recordValidation(ctx context.Context, authI
}

// countFailedValidation increments the failed authorizations per domain per
// account rate limit. There is no reason to surface errors from this function
// to the Subscriber, spends against this limit are best effort.
func (ra *RegistrationAuthorityImpl) countFailedValidation(ctx context.Context, regId int64, name string) {
// account rate limit. If the AutomaticallyPauseZombieClients feature has been
// enabled, it also increments the failed authorizations for pausing per domain
// per account rate limit. There is no reason to surface errors from this
Comment on lines 1820 to +1823
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// countFailedValidation increments the FailedAuthorizationsPerDomainPerAccount. If the AutomaticallyPauseZombieClients feature has been enabled, it also increments the FailedAuthorizationsForPausingPerDomainPerAccountTransaction rate limit

// function to the Subscriber, spends against this limit are best effort.
func (ra *RegistrationAuthorityImpl) countFailedValidation(ctx context.Context, regId int64, ident identifier.ACMEIdentifier) {
if ra.limiter == nil || ra.txnBuilder == nil {
// Limiter is disabled.
return
}

txn, err := ra.txnBuilder.FailedAuthorizationsPerDomainPerAccountSpendOnlyTransaction(regId, name)
txn, err := ra.txnBuilder.FailedAuthorizationsPerDomainPerAccountSpendOnlyTransaction(regId, ident.Value)
if err != nil {
ra.log.Warningf("building rate limit transaction for the %s rate limit: %s", ratelimits.FailedAuthorizationsPerDomainPerAccount, err)
}
Expand All @@ -1830,6 +1840,54 @@ func (ra *RegistrationAuthorityImpl) countFailedValidation(ctx context.Context,
}
ra.log.Warningf("spending against the %s rate limit: %s", ratelimits.FailedAuthorizationsPerDomainPerAccount, err)
}

pgporada marked this conversation as resolved.
Show resolved Hide resolved
if features.Get().AutomaticallyPauseZombieClients {
txn, err = ra.txnBuilder.FailedAuthorizationsForPausingPerDomainPerAccountTransaction(regId, ident.Value)
if err != nil {
ra.log.Warningf("building rate limit transaction for the %s rate limit: %s", ratelimits.FailedAuthorizationsForPausingPerDomainPerAccount, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is a copy of some existing code, but I think both line 1833 and this line should instead return an error. I know this function tries to avoid returning errors to the caller, but a failure to build the rate limit transaction represents some sort of internal logic error, and that should become a 500 (which helps ensure it shows up in our metrics, and gets logged in the WFE with some useful context).

Also, now that there are two places within this function where we look for and discard Canceled / DeadlineExceeded errors, it makes more sense to return those to the caller as well, and have the caller look for Canceled / DeadlineExceeded (so we won't have to duplicate logic as much).

}

decision, err := ra.limiter.Spend(ctx, txn)
if err != nil {
if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
return
}
ra.log.Warningf("spending against the %s rate limit: %s", ratelimits.FailedAuthorizationsForPausingPerDomainPerAccount, err)
}

if decision.Result(ra.clk.Now()) != nil {
resp, err := ra.SA.PauseIdentifiers(ctx, &sapb.PauseRequest{
RegistrationID: regId,
Identifiers: []*corepb.Identifier{
{
Type: string(ident.Type),
Value: ident.Value,
},
},
})
if err != nil {
ra.log.Warningf("failed to pause %d/%q: %s", regId, ident.Value, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's another place where we should simply return the error, and let the caller filter out Canceled / DeadlineExceeded if it wants to.

}
ra.pauseCounter.With(prometheus.Labels{
"paused": strconv.FormatBool(resp.Paused > 0),
"repaused": strconv.FormatBool(resp.Repaused > 0),
"grace": strconv.FormatBool(resp.Paused <= 0 && resp.Repaused <= 0),
}).Inc()
}
}
}

// resetAccountPausingLimit resets bucket to maximum capacity for given account.
// There is no reason to surface errors from this function to the Subscriber.
func (ra *RegistrationAuthorityImpl) resetAccountPausingLimit(ctx context.Context, regId int64, ident identifier.ACMEIdentifier) {
pgporada marked this conversation as resolved.
Show resolved Hide resolved
bucketKey, err := ratelimits.NewRegIdDomainBucketKey(ratelimits.FailedAuthorizationsForPausingPerDomainPerAccount, regId, ident.Value)
if err != nil {
ra.log.Warningf("creating bucket key for regID=[%d] identifier=[%s]: %s", regId, ident.Value, err)
}
err = ra.limiter.Reset(ctx, bucketKey)
if err != nil {
ra.log.Warningf("resetting bucket for regID=[%d] identifier=[%s]: %s", regId, ident.Value, err)
}
}

// PerformValidation initiates validation for a specific challenge associated
Expand Down Expand Up @@ -1953,9 +2011,12 @@ func (ra *RegistrationAuthorityImpl) PerformValidation(
if prob != nil {
challenge.Status = core.StatusInvalid
challenge.Error = prob
go ra.countFailedValidation(vaCtx, authz.RegistrationID, authz.Identifier.Value)
go ra.countFailedValidation(vaCtx, authz.RegistrationID, authz.Identifier)
} else {
challenge.Status = core.StatusValid
if features.Get().AutomaticallyPauseZombieClients {
ra.resetAccountPausingLimit(vaCtx, authz.RegistrationID, authz.Identifier)
}
}
challenge.Validated = &vStart
authz.Challenges[challIndex] = *challenge
Expand Down
Loading
Loading