Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transient RefreshError exceptions from compute_engine credentials are not retried #1562

Open
droyo opened this issue Jul 25, 2024 · 2 comments
Assignees

Comments

@droyo
Copy link

droyo commented Jul 25, 2024

It's not unheard of for requests to the GCE metadata server to fail with a transient error, like a 503, 500, or 429. These requests can and should be retried, and they are in certain code paths. However, one very important code path, the credentials.refresh(...) method for credentials taken from the GCE metadata server, does not.

Environment details

  • OS: Debian 11
  • Python version: 3.9.2
  • pip version: 20.3.4
  • google-auth version: 2.32.0

Steps to reproduce

  1. Setup an http proxy on http://localhost:8080/ to inject transient 429 errors. Here is an example that will make every other request to a /token endpoint fail with status code 429.
  2. Save getcreds.py:
import google.auth
import google.auth.transport.requests

import logging

logging.getLogger('google.auth').setLevel(logging.DEBUG)
logging.getLogger('google.auth').addHandler(logging.StreamHandler())

request = google.auth.transport.requests.Request()
creds, project = google.auth.default()
creds.refresh(request)
  1. Run while true ; do http_proxy=http://localhost:8080/ python3 getcreds.py; sleep 1; done
  2. The 1st or second request should fail. Here is some example output:
Checking None for explicit credentials as part of auth process...
Checking Cloud SDK credentials as part of auth process...
Cloud SDK credentials not found on disk; not using them
Making request: GET http://169.254.169.254
Making request: GET http://metadata.google.internal/computeMetadata/v1/project/project-id
Making request: GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true
Making request: GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/REDACTED/token
Traceback (most recent call last):
  File "/home/REDACTED/.local/lib/python3.9/site-packages/google/auth/compute_engine/credentials.py", line 127, in refresh
    self.token, self.expiry = _metadata.get_service_account_token(
  File "/home/REDACTED/.local/lib/python3.9/site-packages/google/auth/compute_engine/_metadata.py", line 351, in get_service_account_token
    token_json = get(request, path, params=params, headers=metrics_header)
  File "/home/REDACTED/.local/lib/python3.9/site-packages/google/auth/compute_engine/_metadata.py", line 243, in get
    raise exceptions.TransportError(
google.auth.exceptions.TransportError: ("Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/REDACTED/token from the Google Compute Engine metadata service. Status: 429 Response:\nb'Too many requests\\n'", <google.auth.transport.requests._Response object at 0x7fe0a13849d0>)

Note how only 1 attempt is logged for the /token endpoint.

We do retry on certain types of errors:

while retries < retry_count:
try:
response = request(url=url, method="GET", headers=headers_to_use)
break
except exceptions.TransportError as e:
_LOGGER.warning(
"Compute Engine Metadata server unavailable on "
"attempt %s of %s. Reason: %s",
retries + 1,
retry_count,
e,
)
retries += 1

But in the scenario I'm complaining about, the HTTP request completes successfully, but the response code indicates a transient error, so we hit this code path:

else:
raise exceptions.TransportError(
"Failed to retrieve {} from the Google Compute Engine "
"metadata service. Status: {} Response:\n{}".format(
url, response.status, response.data
),
response,
)

The following patch to google/auth/compute/_metadata.py "fixes" the reproduction:

--- _metadata.py.orig   2024-07-25 01:51:12.567167923 +0000
+++ _metadata.py        2024-07-25 01:51:57.026072333 +0000
@@ -28,6 +28,7 @@
 from google.auth import environment_vars
 from google.auth import exceptions
 from google.auth import metrics
+from google.auth import transport
 
 _LOGGER = logging.getLogger(__name__)
 
@@ -202,7 +203,11 @@
     while retries < retry_count:
         try:
             response = request(url=url, method="GET", headers=headers_to_use)
-            break
+            if response.status in transport.DEFAULT_RETRYABLE_STATUS_CODES:
+                retries += 1
+                continue
+            else:
+                break
 
         except exceptions.TransportError as e:
             _LOGGER.warning(

I put "fixes" in quotes because in a real failure, a transient error is likely caused by the GCE metadata server or one of its dependencies being overwhelmed, and some degree of exponential backoff should be used. The existing logic makes sense for a timeout, because some time has already been spent waiting.

A separate but related request would be for the RefreshError raised to have the retryable property set appropriately, so library users can decide what to do on transient failures.

@clundin25
Copy link
Contributor

I think this is a reasonable thing to add. This repo has an exponential backoff implementation that can be used for retries.

@clundin25 clundin25 self-assigned this Jul 26, 2024
@droyo
Copy link
Author

droyo commented Jul 29, 2024

If it helps, I did a brief survey of implementations of this library in other languages.

The Go implementation does not seem to retry on 429 error.

The Java implementation also does not seem to retry, but it looking at this mapping the authors may have intended for a 429 response to be retryable.

I tried looking into the NodeJS implementation but lost interest. I think it would be retried.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants