🐛 retry pulling image layer when they timeout #7051

GitHK · 2025-01-16T14:53:37Z

What do these changes do?

Sometimes when pulling images from a registry they timeout. Usually this is more evident when pulling very big images, especially if the registry is not located close by to your machine.
Most likely the Timeout error is a symptom of a lower networking error.

Added a retry in the image pulling.
extended progress bar to rese the progress in order to cope nicely with the progress bar display after an error and avoid warnings in the logs

Avoids services failing upon start when this error happens.

How is this shown to the user?

now retry occurred

with retry

Related issue/s

How to test

point your registry in your deployment to to master-zmt and try to open a notebook
progress bar always goes up (never goes down)
there are no errors logs Progress already reached maximum in the dy-sidecar
if the error occurs you will find multiple attempt='%s' to pull image='%s' log entries

Dev-ops checklist

No ENV changes or I properly updated ENV (read the instruction)

…ling-with-progress-reset

packages/service-library/tests/test_progress_bar.py

packages/service-library/src/servicelib/docker_utils.py

sanderegg

I added a few comments, I would like you to reconsider the user friendliness of the progress bar changes. we can discuss

sanderegg · 2025-01-16T16:31:26Z

packages/service-library/src/servicelib/progress_bar.py

@@ -197,6 +197,9 @@ async def update(self, steps: float = 1) -> None:
            await self._update_parent(parent_update_value)
        await self._report_external(new_progress_value)

+    def reset_progress(self) -> None:
+        self._current_steps = _INITIAL_VALUE


Is that what you mean by the progress never goes "down"?
you silently reset the progress and the user does not know about it.

Let's discuss the following use-case:

a large image of 20-60GiB,

pulling it takes a very long time, and the progress is at 80%,

now comes an error and it will retry, if I get that right then you reset the progress bar which kind of make sense,
--> but the progress stays at 80%??
--> what does the user see until it eventually reaches 80% again? nothing?
if that is the case I don't think that is a nice user experience at all:
--> it makes user not believe our progress bars at all,
--> it probably makes the user shutdown the website and call it a day, it's too much waiting without feedback.

As I said, I think what makes sense if I am a user is:

the progress bar goes back to 0

I get a message saying "sorry something bad happened while pulling your image, we try again, please be patient..."

I think one option when you reset is to change the message by adding something like:

new_description = current_progress_bar.description += f"(retrying {attemp})" current_progress_bar.reset_progress(new_message=new_description)

and reset_progress should then call set_ so that the update is transmitted

I took the opportunity to extend the progress bar to also include retry attempts. These are also now displayed directly by the frontend.

Please see screenshots above in the description

packages/service-library/tests/test_progress_bar.py

packages/service-library/src/servicelib/docker_utils.py

sanderegg · 2025-01-16T16:40:56Z

packages/service-library/src/servicelib/docker_utils.py

+        async for attempt in AsyncRetrying(
+            wait=wait_random_exponential(),
+            stop=stop_after_attempt(retry_upon_error_count),
+            reraise=True,
+            retry=retry_if_exception_type(asyncio.TimeoutError),
        ):
-            try:
-                parsed_progress = TypeAdapter(_DockerPullImage).validate_python(
-                    pull_progress
-                )
-            except ValidationError:
-                _logger.exception(
-                    "Unexpected error while validating '%s'. "
-                    "TIP: This is probably an unforeseen pull status text that shall be added to the code. "
-                    "The pulling process will still continue.",
-                    f"{pull_progress=}",
-                )
-            else:
-                await _parse_pull_information(
-                    parsed_progress, layer_id_to_size=layer_id_to_size
-                )
-
-            # compute total progress
-            total_downloaded_size = sum(
-                layer.downloaded for layer in layer_id_to_size.values()
-            )
-            total_extracted_size = sum(
-                layer.extracted for layer in layer_id_to_size.values()
-            )
-            total_progress = (total_downloaded_size + total_extracted_size) / 2.0
-            progress_to_report = total_progress - reported_progress
-            await progress_bar.update(progress_to_report)
-            reported_progress = total_progress
-
-            await log_cb(
-                f"pulling {image_short_name}: {pull_progress}...",
-                logging.DEBUG,
+            # each time there is an error progress starts from zero
+            progress_bar.reset_progress()
+            _logger.info(
+                "attempt='%s' to pull image='%s'",
+                attempt.retry_state.attempt_number,
+                image,
            )
+


I thought we already discussed that AsyncRetrying like so creates bad logs where you do not see what function is retried, so I would recommend creating a separate function and using the retry decorator instead.

you do not log as warning like we ususally do, which would allow to not have the log at line 273

you should not reset the progress on the very first attempt as a consequence of the discussion in the progress bar module

I personally think that 10 retries for pulling images is too much. 10 times pulling a large image is equivalent to infinity. I guess 3 is already more than enough, I guess people will get bored before it tried 3 times.

Wrapped in a function for better error messages.

Redisegned, let's see if it's OK now.

This one depends a bit on the image. I manage to get the same amount of errors on some images, but regardless of the size 3 might not be enough. Since these are timeout errors, it's most likely safe to continue pulling.

…ling-with-progress-reset

codecov · 2025-01-30T13:22:20Z

Codecov Report

Attention: Patch coverage is 95.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 87.58%. Comparing base (1dcd369) to head (a4a2526).
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7051      +/-   ##
==========================================
- Coverage   87.69%   87.58%   -0.12%     
==========================================
  Files        1635     1463     -172     
  Lines       63943    60042    -3901     
  Branches     1178      959     -219     
==========================================
- Hits        56075    52586    -3489     
+ Misses       7556     7204     -352     
+ Partials      312      252      -60

Flag	Coverage Δ
integrationtests	`64.66% <100.00%> (+<0.01%)`	⬆️
unittests	`85.76% <92.50%> (-0.22%)`	⬇️

Components	Coverage Δ
api	`∅ <ø> (∅)`
pkg_aws_library	`∅ <ø> (∅)`
pkg_dask_task_models_library	`∅ <ø> (∅)`
pkg_models_library	`91.51% <100.00%> (ø)`
pkg_notifications_library	`∅ <ø> (∅)`
pkg_postgres_database	`∅ <ø> (∅)`
pkg_service_integration	`70.03% <ø> (ø)`
pkg_service_library	`74.46% <93.93%> (+0.06%)`	⬆️
pkg_settings_library	`∅ <ø> (∅)`
pkg_simcore_sdk	`85.46% <100.00%> (-0.05%)`	⬇️
agent	`96.46% <ø> (ø)`
api_server	`90.55% <ø> (ø)`
autoscaling	`96.08% <ø> (ø)`
catalog	`90.33% <ø> (ø)`
clusters_keeper	`99.24% <ø> (ø)`
dask_sidecar	`91.25% <ø> (-0.02%)`	⬇️
datcore_adapter	`93.19% <ø> (ø)`
director	`76.92% <ø> (ø)`
director_v2	`91.27% <ø> (-0.02%)`	⬇️
dynamic_scheduler	`97.21% <ø> (ø)`
dynamic_sidecar	`89.74% <ø> (-0.01%)`	⬇️
efs_guardian	`90.25% <ø> (ø)`
invitations	`93.28% <ø> (ø)`
osparc_gateway_server	`∅ <ø> (∅)`
payments	`92.66% <ø> (ø)`
resource_usage_tracker	`88.73% <ø> (-0.11%)`	⬇️
storage	`89.57% <ø> (ø)`
webclient	`∅ <ø> (∅)`
webserver	`86.23% <ø> (+0.02%)`	⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1dcd369...a4a2526. Read the comment docs.

…ling-with-progress-reset

sonarqubecloud · 2025-02-03T15:25:57Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

odeimaiz

ok

added progress reset to progress bar

ee11b6c

GitHK self-assigned this Jan 16, 2025

GitHK added a:services-library issues on packages/service-libs a:dynamic-sidecar dynamic-sidecar service t:maintenance Some planned maintenance work bug buggy, it does not work as expected labels Jan 16, 2025

GitHK added this to the Singularity milestone Jan 16, 2025

Andrei Neagu added 3 commits January 16, 2025 16:14

add attempt logging

43886aa

Merge remote-tracking branch 'upstream/master' into pr-osparc-fix-pul…

24228c8

…ling-with-progress-reset

update log entry

139b482

GitHK marked this pull request as ready for review January 16, 2025 15:46

GitHK requested a review from pcrespov as a code owner January 16, 2025 15:46

GitHK requested review from giancarloromeo, sanderegg and matusdrobuliak66 January 16, 2025 15:46

pcrespov approved these changes Jan 16, 2025

View reviewed changes

packages/service-library/tests/test_progress_bar.py Outdated Show resolved Hide resolved

packages/service-library/src/servicelib/docker_utils.py Outdated Show resolved Hide resolved

sanderegg requested changes Jan 16, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/master' into pr-osparc-fix-pul…

1f48aa7

…ling-with-progress-reset

giancarloromeo approved these changes Jan 21, 2025

View reviewed changes

matusdrobuliak66 approved these changes Jan 24, 2025

View reviewed changes

Andrei Neagu added 2 commits January 30, 2025 13:58

Merge remote-tracking branch 'upstream/master' into pr-osparc-fix-pul…

60132fe

…ling-with-progress-reset

removed type IDStr form description

d27914b

Andrei Neagu added 3 commits January 31, 2025 13:49

Merge remote-tracking branch 'upstream/master' into pr-osparc-fix-pul…

1abd071

…ling-with-progress-reset

added attempts

338a669

Merge remote-tracking branch 'upstream/master' into pr-osparc-fix-pul…

a4a2526

…ling-with-progress-reset

GitHK requested review from jsaq007, odeimaiz and ignapas as code owners February 3, 2025 15:24

odeimaiz approved these changes Feb 3, 2025

View reviewed changes

GitHK requested a review from sanderegg February 4, 2025 07:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 retry pulling image layer when they timeout #7051

🐛 retry pulling image layer when they timeout #7051

GitHK commented Jan 16, 2025 •

edited

Loading

sanderegg left a comment

sanderegg Jan 16, 2025

sanderegg Jan 16, 2025

GitHK Feb 3, 2025

GitHK Feb 3, 2025

sanderegg Jan 16, 2025

GitHK Feb 3, 2025

codecov bot commented Jan 30, 2025 •

edited

Loading

sonarqubecloud bot commented Feb 3, 2025

odeimaiz left a comment

🐛 retry pulling image layer when they timeout #7051

Are you sure you want to change the base?

🐛 retry pulling image layer when they timeout #7051

Conversation

GitHK commented Jan 16, 2025 • edited Loading

What do these changes do?

Related issue/s

How to test

Dev-ops checklist

sanderegg left a comment

Choose a reason for hiding this comment

sanderegg Jan 16, 2025

Choose a reason for hiding this comment

sanderegg Jan 16, 2025

Choose a reason for hiding this comment

GitHK Feb 3, 2025

Choose a reason for hiding this comment

GitHK Feb 3, 2025

Choose a reason for hiding this comment

sanderegg Jan 16, 2025

Choose a reason for hiding this comment

GitHK Feb 3, 2025

Choose a reason for hiding this comment

codecov bot commented Jan 30, 2025 • edited Loading

Codecov Report

sonarqubecloud bot commented Feb 3, 2025

Quality Gate passed

odeimaiz left a comment

Choose a reason for hiding this comment

GitHK commented Jan 16, 2025 •

edited

Loading

codecov bot commented Jan 30, 2025 •

edited

Loading