-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 retry pulling image layer when they timeout #7051
base: master
Are you sure you want to change the base?
🐛 retry pulling image layer when they timeout #7051
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a few comments, I would like you to reconsider the user friendliness of the progress bar changes. we can discuss
@@ -197,6 +197,9 @@ async def update(self, steps: float = 1) -> None: | |||
await self._update_parent(parent_update_value) | |||
await self._report_external(new_progress_value) | |||
|
|||
def reset_progress(self) -> None: | |||
self._current_steps = _INITIAL_VALUE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that what you mean by the progress never goes "down"?
you silently reset the progress and the user does not know about it.
Let's discuss the following use-case:
- a large image of 20-60GiB,
- pulling it takes a very long time, and the progress is at 80%,
- now comes an error and it will retry, if I get that right then you reset the progress bar which kind of make sense,
--> but the progress stays at 80%??
--> what does the user see until it eventually reaches 80% again? nothing?
if that is the case I don't think that is a nice user experience at all:
--> it makes user not believe our progress bars at all,
--> it probably makes the user shutdown the website and call it a day, it's too much waiting without feedback.
As I said, I think what makes sense if I am a user is:
- the progress bar goes back to 0
- I get a message saying "sorry something bad happened while pulling your image, we try again, please be patient..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one option when you reset is to change the message by adding something like:
new_description = current_progress_bar.description += f"(retrying {attemp})"
current_progress_bar.reset_progress(new_message=new_description)
and reset_progress should then call set_
so that the update is transmitted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the opportunity to extend the progress bar to also include retry attempts. These are also now displayed directly by the frontend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see screenshots above in the description
async for attempt in AsyncRetrying( | ||
wait=wait_random_exponential(), | ||
stop=stop_after_attempt(retry_upon_error_count), | ||
reraise=True, | ||
retry=retry_if_exception_type(asyncio.TimeoutError), | ||
): | ||
try: | ||
parsed_progress = TypeAdapter(_DockerPullImage).validate_python( | ||
pull_progress | ||
) | ||
except ValidationError: | ||
_logger.exception( | ||
"Unexpected error while validating '%s'. " | ||
"TIP: This is probably an unforeseen pull status text that shall be added to the code. " | ||
"The pulling process will still continue.", | ||
f"{pull_progress=}", | ||
) | ||
else: | ||
await _parse_pull_information( | ||
parsed_progress, layer_id_to_size=layer_id_to_size | ||
) | ||
|
||
# compute total progress | ||
total_downloaded_size = sum( | ||
layer.downloaded for layer in layer_id_to_size.values() | ||
) | ||
total_extracted_size = sum( | ||
layer.extracted for layer in layer_id_to_size.values() | ||
) | ||
total_progress = (total_downloaded_size + total_extracted_size) / 2.0 | ||
progress_to_report = total_progress - reported_progress | ||
await progress_bar.update(progress_to_report) | ||
reported_progress = total_progress | ||
|
||
await log_cb( | ||
f"pulling {image_short_name}: {pull_progress}...", | ||
logging.DEBUG, | ||
# each time there is an error progress starts from zero | ||
progress_bar.reset_progress() | ||
_logger.info( | ||
"attempt='%s' to pull image='%s'", | ||
attempt.retry_state.attempt_number, | ||
image, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I thought we already discussed that
AsyncRetrying
like so creates bad logs where you do not see what function is retried, so I would recommend creating a separate function and using theretry
decorator instead. - you do not log as warning like we ususally do, which would allow to not have the log at line 273
- you should not reset the progress on the very first attempt as a consequence of the discussion in the progress bar module
- I personally think that 10 retries for pulling images is too much. 10 times pulling a large image is equivalent to infinity. I guess 3 is already more than enough, I guess people will get bored before it tried 3 times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Wrapped in a function for better error messages.
- Redisegned, let's see if it's OK now.
- This one depends a bit on the image. I manage to get the same amount of errors on some images, but regardless of the size 3 might not be enough. Since these are timeout errors, it's most likely safe to continue pulling.
…ling-with-progress-reset
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7051 +/- ##
==========================================
- Coverage 87.69% 87.58% -0.12%
==========================================
Files 1635 1463 -172
Lines 63943 60042 -3901
Branches 1178 959 -219
==========================================
- Hits 56075 52586 -3489
+ Misses 7556 7204 -352
+ Partials 312 252 -60
Continue to review full report in Codecov by Sentry.
|
…ling-with-progress-reset
…ling-with-progress-reset
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
What do these changes do?
Sometimes when pulling images from a registry they timeout. Usually this is more evident when pulling very big images, especially if the registry is not located close by to your machine.
Most likely the Timeout error is a symptom of a lower networking error.
Avoids services failing upon start when this error happens.
How is this shown to the user?
now retry occurred
with retry
Related issue/s
How to test
Progress already reached maximum
in the dy-sidecarattempt='%s' to pull image='%s'
log entriesDev-ops checklist