Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp hangs forever when logging images #858

Open
f-fuchs opened this issue Jan 23, 2025 · 6 comments
Open

exp hangs forever when logging images #858

f-fuchs opened this issue Jan 23, 2025 · 6 comments
Labels
bug Did we break something?

Comments

@f-fuchs
Copy link

f-fuchs commented Jan 23, 2025

Hey,

I started logging images more specially matplotlib figures using log_image.
Everything seemed to be working but yesterday I started two experiments and they were both stuck in their last epoch.
When I checked the logs both of them report following RuntimeError:

Exception in thread Thread-3 (worker):
Traceback (most recent call last):
  File "/home/fuchsfa/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/home/fuchsfa/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "/home/fuchsfa/wavelet-extraction-model/.venv/lib/python3.12/site-packages/dvclive/live.py", line 911, in worker
    post_to_studio(item, "data")
  File "/home/fuchsfa/wavelet-extraction-model/.venv/lib/python3.12/site-packages/dvclive/utils.py", line 182, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/fuchsfa/wavelet-extraction-model/.venv/lib/python3.12/site-packages/dvclive/studio.py", line 114, in post_to_studio
    metrics, params, plots = get_studio_updates(live)
                             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fuchsfa/wavelet-extraction-model/.venv/lib/python3.12/site-packages/dvclive/studio.py", line 82, in get_studio_updates
    plots_to_send.update(_adapt_images(live))
                         ^^^^^^^^^^^^^^^^^^^
  File "/home/fuchsfa/wavelet-extraction-model/.venv/lib/python3.12/site-packages/dvclive/studio.py", line 56, in _adapt_images
    for image in live._images.values()
                 ^^^^^^^^^^^^^^^^^^^^^
RuntimeError: dictionary changed size during iteration

So first off why did this not crash and just mark the experiment as failed but kept running indefinitely?
Secondly what is happening to cause the error?

The error does not occur using dvc repro , which makes sense because it occurs in the post_to_studio method.

Also is there a way for me to stop posting to studio for just this repository?

@shcheklein
Copy link
Member

So first off why did this not crash and just mark the experiment as failed but kept running indefinitely?
Secondly what is happening to cause the error?

can because live thread is made to be isolated, we would need to check this

When I checked the logs both of them report following RuntimeError:

could you share some code that can reproduce this?

Also is there a way for me to stop posting to studio for just this repository?

Give a DVC_STUDIO_OFFLINE a try to disable it. Or dvc config studio.offline.

@shcheklein shcheklein added awaiting response we are waiting for your reply, please respond! :) bug Did we break something? labels Jan 24, 2025
@shcheklein
Copy link
Member

Okay, I think I know where the issue is. get_studio_updates is called from a different thread and collects data to send to Studio, while the main thread keeps running and might be updating Live instance values. This can leads to race conditions like in this ticket, but also to an inconsistency in data that we are sending, or even some data skipped altogether.

I think the best solution is to "package" a snapshot of data to send in the main thread and queue it instead of the reference to the live instance alone.

@shcheklein shcheklein removed the awaiting response we are waiting for your reply, please respond! :) label Jan 25, 2025
@f-fuchs
Copy link
Author

f-fuchs commented Jan 27, 2025

do you still need code to reproduce it? I probably cant share the original code and would have to recreate a minimal example hoping that triggers the error as well.

@shcheklein
Copy link
Member

No, I think I understand where the problem is. I've prepared a PR #860 - it solves the root cause of the problem I think, and we need one more PR probably to prevent experiments from hanging forever if some exception is happening.

@shcheklein
Copy link
Member

@f-fuchs could you please give the main branch a try and see if that is still happening?

@f-fuchs
Copy link
Author

f-fuchs commented Jan 30, 2025

Not sure I can reproduce it again. The repository this code was from should have never posted to studio, due it being hosted on an internal gitlab repository. after this error I realized my setup was incorrect and removed the global studio access.

therefore the error is not really a problem for me anymore just wanted to let u guys now that there is a potential bug.

if u want to recreate it, i think the way i trigged it was having something like this

with Live as live:
    for epoch in epochs
        # training code 
        if epoch % 10 == 0:
            log_image()
        live.step()
    log_image()

where two log_image calls could basically happen directly after the other if the last epoch was divisible by ten

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something?
Projects
None yet
Development

No branches or pull requests

2 participants