Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add support for async OAuth token refreshes #1135

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

renaudhartert-db
Copy link
Contributor

@renaudhartert-db renaudhartert-db commented Jan 22, 2025

What changes are proposed in this pull request?

This PR aims at eliminating long-tail latency due to OAuth token refreshes in scenarios where a single client is responsible for a relatively high (e.g. > 1 QPS) continuous outbound traffic. The feature is disabled by default — which arguably makes this PR a functional no-op.

Precisely, the PR introduces a new token cache which attempts to always keep its token fresh by asynchronously refreshing the token before it expires. We differentiate three token states:

  • fresh: The token is valid and is not close to its expiration.
  • stale: The token is valid but will expire soon.
  • expired: The token has expired and cannot be used.

Each time a request tries to access the token, we do the following:

  • If the token is fresh, return the current token;
  • If the token is stale, trigger an asynchronous refresh and return the current token;
  • If the token is expired, make a blocking refresh call to update the token and return it.

In particular, asynchronous refreshes use a lock to guarantee that there can only be one pending refresh at a given time.

The performance of the algorithm depends on the length of the stale and fresh periods. On the first hand, the stale period must be long enough to prevent tokens from entering the expired state. On the other hand, a long stale period reduces the length of the fresh period, thus increasing the refresh frequency.

Right now, the stale period is configured to 3 minutes by default (i.e. 5% of the expected token lifespan of 1 hour). This value might be changed in the future to guarantee that the default behavior achieves the best performance for the majority of users.

For reviewers:

  • This PR only uses the new cache in control-plane auth flows; I plan to send a follow-up PR to enable asynchronous refresh in data-plane flows once this one has been merged.
  • Interface oauth2.TokenSource is likely not sufficient for us and we would need one with a Token method that takes a context.Context as parameter: Token(context.Context) (Token, error).

How is this tested?

Complete test coverage with a focus on various concurrency scenarios.

@renaudhartert-db renaudhartert-db changed the title [Feature] Add async token cache [Feature] Add support for async OAuth token refreshes Jan 23, 2025
@renaudhartert-db renaudhartert-db marked this pull request as ready for review January 23, 2025 08:43
cts := &cachedTokenSource{
tokenSource: ts,
staleDuration: defaultStaleDuration,
disableAsync: defaultDisableAsyncRefresh,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disable -> enable + flip the defaults

By including negation in the property, expressions often become double negations, which are harder to read/interpret than a simple if enableAsync.

Copy link
Contributor Author

@renaudhartert-db renaudhartert-db Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed in general but I think this should be a little more nuanced.

Ultimately, the intended default behavior is to enable async. Using disable as the field guarantees that the zero value is the default. This makes it easy to reason about the code as a non-zero value always correspond to a special case — which would typically be handed at higher level of indentation.

I'd prefer to keep the code as it is (note that the public interface WithAsyncRefresh has no negation) except if you feel strongly about it.

config/experimental/auth/auth.go Show resolved Hide resolved
case fresh:
cts.mu.Lock()
defer cts.mu.Unlock()
return cts.cachedToken, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acquiring the mutex twice is a race condition.

The cached token itself and its state are coupled and should be retrieved atomically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that making the operation atomic would speed up the code. Though, I'm not sure I see what the race condition would be. It's true that we might return a token that is older than what the state suggest but that can happen even if the operation is atomic, it is just a little more likely right now.

defer cts.mu.Unlock()
return cts.cachedToken, nil
default: // expired
return cts.blockingToken()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By locking inside this function and not deduping the calls to Token() become sequential if there is a failure.

I think it is fair to try and acquire the token once and also cache the failure.

Doing this would also allow for a single token refresh path instead of having separate async refresh and blocking refresh functions.

Copy link
Contributor Author

@renaudhartert-db renaudhartert-db Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By locking inside this function and not deduping the calls to Token() become sequential if there is a failure.

I might be misunderstanding your comment but shouldn't the calls be sequential even if we were to cache the error? The blockingToken function already contains a path (lines 186-188) to avoid calling the TokenSource if a call (either blocking or async) successfully refreshed the token.

I think it is fair to try and acquire the token once and also cache the failure.

Considering blockingToken errors as transient is intended. This guarantees that the errors returned by cachedTokenSource have the same semantic as the errors returned by the wrapped TokenSource. I'd like to keep that property for now; what do you think?

Copy link

If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:

Trigger:
go/deco-tests-run/sdk-go

Inputs:

  • PR number: 1135
  • Commit SHA: 8ac8c6e3ed8cd1a2350c495bee1d73eca9404fe4

Checks will be approved automatically on success.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants