Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply retry logics in confidential computing API + workload image puller #511

Merged
merged 4 commits into from
Dec 18, 2024

Conversation

yawangwang
Copy link
Collaborator

@yawangwang yawangwang commented Nov 2, 2024

This PR serves the purpose of mitigating the 404 error metadata entry test/token not found from client prober tests.

After investigation, several factors contribute to this 404 error:

  • 401 unauthorized error when pulling container image. see cloud loggings
  • 500 internal error when calling v1.GetLocation.
I1004 11:07:17.754233   13330 instance.go:634] [LogVMSerialConsole] 2024/10/04 18:05:54 failed to create REST verifier client: invalid region "southamerica-east1", available regions are [africa-south1, asia-east1, asia-east2, asia-northeast1, asia-northeast2, asia-northeast3, asia-south1, asia-south2, asia-southeast1, asia-southeast2, australia-southeast1, australia-southeast2, europe-central2, europe-north1, europe-southwest1, europe-west1, europe-west10, europe-west12, europe-west2, europe-west3, europe-west4, europe-west6, europe-west8, europe-west9, me-central1, me-central2, me-west1, northamerica-northeast1, northamerica-northeast2, southamerica-east1, southamerica-west1, us-central1, us-east1, us-east4, us-east5, us-east7, us-south1, us-west1, us-west2, us-west3, us-west4]: googleapi: Error 500: Internal error encountered.
  • 500 internal error when calling v1.VerifyAttestation. see cloud loggings
  • 502 bad gateway error when pulling container image. see cloud loggings

To mitigate the 404 test flakes, we should apply retry strategies to the following places:

  • when the launcher pulls the workload image.
  • when the rest client calls GetLocation.
  • when the rest client calls VerifyAttestation.

@yawangwang yawangwang force-pushed the retry_in_launcher branch 3 times, most recently from b462410 to 9145a14 Compare November 4, 2024 23:03
@yawangwang yawangwang changed the title Apply retry logics in launcher Apply retry logics in confidential computing API + workload image puller Nov 4, 2024
@yawangwang yawangwang marked this pull request as ready for review November 4, 2024 23:50
@yawangwang yawangwang requested a review from jkl73 November 4, 2024 23:50
verifier/rest/retry.go Outdated Show resolved Hide resolved
@@ -89,7 +95,7 @@ func NewClient(ctx context.Context, projectID string, region string, opts ...opt
}

type restClient struct {
v1Client *v1.Client
v1Client *retryableClient
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we do the retry in the restClient's functions?

The retry behavior is specific to the rest client, and then we don't need to create a wrapper around v1.Client anymore

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. We manipulated the CallOptions directly in the returned client.

@yawangwang yawangwang force-pushed the retry_in_launcher branch 3 times, most recently from fa4e5fd to 116999b Compare November 8, 2024 18:53
Copy link
Contributor

@JoshuaKrstic JoshuaKrstic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, its really nice that we are able to override callOptions like that.

@yawangwang
Copy link
Collaborator Author

/gcbrun

}

// Attest executes doAttest with retries when 500 errors originate from VerifyAttestation API.
func (a *agent) AttestWithRetries(ctx context.Context, opts AttestAgentOpts, retry func() backoff.BackOff) ([]byte, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel we shouldn't do the retries in the agent. Instead this is the job for the caller, in our case, it should be handled in container_runner

@jkl73
Copy link
Contributor

jkl73 commented Dec 9, 2024

After looking into all the retry logic in "Attest", I feel like the part is adding too much complexity without much value added.

  • fetchAndWriteTokenWithRetry in container_runner is already handling the attest with retry
  • If the Attest is invoked by the workload through the TEE SERVER API, the caller which is the workload should handle the retry logic.

Change for rest.go and image pulling retry LGTM, let me know if this makes sense!

@yawangwang
Copy link
Collaborator Author

After looking into all the retry logic in "Attest", I feel like the part is adding too much complexity without much value added.

  • fetchAndWriteTokenWithRetry in container_runner is already handling the attest with retry
  • If the Attest is invoked by the workload through the TEE SERVER API, the caller which is the workload should handle the retry logic.

Change for rest.go and image pulling retry LGTM, let me know if this makes sense!

Removed retry in the agent layer.

@yawangwang yawangwang force-pushed the retry_in_launcher branch 2 times, most recently from d25d88a to 3e478d1 Compare December 16, 2024 23:17
RequestedRegion string
AvailableRegions []string
err error
func confComputeBackoffPolicy() backoff.BackOff {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this not not needed, it's not referenced anywhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, removed.

Comment on lines +46 to +41
codes.Unavailable,
codes.Internal,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like your retry option only going to retry on these two codes. Do we need to worry about other codes (like in here: https://cloud.google.com/storage/docs/retry-strategy#go)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These codes are actually the mapping of gRPC error codes in https://github.com/grpc/grpc/blob/master/doc/statuscodes.md. And yes, these are only two error codes we want to retry on because 1. code.Unavailable is the default error code to retry 2. we only observed 500 internal error codes so far which caused test flakiness.

@@ -51,6 +69,9 @@ func NewClient(ctx context.Context, projectID string, region string, opts ...opt
return nil, fmt.Errorf("can't create ConfidentialComputing v1 API client: %w", err)
}

// Override the default retry CallOptions with specific retry policies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reference to the default retry CallOptions, why can't we just rely on the default retry options?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default retry options can be found in here google3/google/cloud/confidentialcomputing/confidentialcomputing_v1_grpc_service_config.json and only retry on the code "UNAVAILABLE".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, we may want to change the default json config later in the API, especially if we want to add more retry codes.

Add the OSS version here for reference:
https://github.com/googleapis/googleapis/blob/master/google/cloud/confidentialcomputing/v1/confidentialcomputing_v1_grpc_service_config.json

@yawangwang
Copy link
Collaborator Author

/gcbrun

@yawangwang yawangwang requested a review from jkl73 December 17, 2024 18:04
@jkl73
Copy link
Contributor

jkl73 commented Dec 18, 2024

/gcbrun

@jkl73 jkl73 merged commit 545a4bc into google:main Dec 18, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants