-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply retry logics in confidential computing API + workload image puller #511
Conversation
b462410
to
9145a14
Compare
verifier/rest/rest.go
Outdated
@@ -89,7 +95,7 @@ func NewClient(ctx context.Context, projectID string, region string, opts ...opt | |||
} | |||
|
|||
type restClient struct { | |||
v1Client *v1.Client | |||
v1Client *retryableClient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we do the retry in the restClient
's functions?
The retry behavior is specific to the rest client, and then we don't need to create a wrapper around v1.Client anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. We manipulated the CallOptions directly in the returned client.
fa4e5fd
to
116999b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, its really nice that we are able to override callOptions like that.
116999b
to
b912537
Compare
b912537
to
1adb6c2
Compare
/gcbrun |
launcher/agent/agent.go
Outdated
} | ||
|
||
// Attest executes doAttest with retries when 500 errors originate from VerifyAttestation API. | ||
func (a *agent) AttestWithRetries(ctx context.Context, opts AttestAgentOpts, retry func() backoff.BackOff) ([]byte, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still feel we shouldn't do the retries in the agent
. Instead this is the job for the caller, in our case, it should be handled in container_runner
After looking into all the retry logic in "Attest", I feel like the part is adding too much complexity without much value added.
Change for rest.go and image pulling retry LGTM, let me know if this makes sense! |
Removed retry in the agent layer. |
d25d88a
to
3e478d1
Compare
verifier/rest/rest.go
Outdated
RequestedRegion string | ||
AvailableRegions []string | ||
err error | ||
func confComputeBackoffPolicy() backoff.BackOff { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this not not needed, it's not referenced anywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, removed.
codes.Unavailable, | ||
codes.Internal, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like your retry option only going to retry on these two codes. Do we need to worry about other codes (like in here: https://cloud.google.com/storage/docs/retry-strategy#go)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These codes are actually the mapping of gRPC error codes in https://github.com/grpc/grpc/blob/master/doc/statuscodes.md. And yes, these are only two error codes we want to retry on because 1. code.Unavailable is the default error code to retry 2. we only observed 500 internal error codes so far which caused test flakiness.
@@ -51,6 +69,9 @@ func NewClient(ctx context.Context, projectID string, region string, opts ...opt | |||
return nil, fmt.Errorf("can't create ConfidentialComputing v1 API client: %w", err) | |||
} | |||
|
|||
// Override the default retry CallOptions with specific retry policies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reference to the default retry CallOptions, why can't we just rely on the default retry options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default retry options can be found in here google3/google/cloud/confidentialcomputing/confidentialcomputing_v1_grpc_service_config.json and only retry on the code "UNAVAILABLE".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, we may want to change the default json config later in the API, especially if we want to add more retry codes.
Add the OSS version here for reference:
https://github.com/googleapis/googleapis/blob/master/google/cloud/confidentialcomputing/v1/confidentialcomputing_v1_grpc_service_config.json
3e478d1
to
78dbb56
Compare
/gcbrun |
/gcbrun |
This PR serves the purpose of mitigating the 404 error
metadata entry test/token not found
from client prober tests.After investigation, several factors contribute to this 404 error:
v1.GetLocation
.v1.VerifyAttestation
. see cloud loggingsTo mitigate the 404 test flakes, we should apply retry strategies to the following places:
GetLocation
.VerifyAttestation
.