-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exp hangs forever when logging images #858
Comments
can because
could you share some code that can reproduce this?
Give a |
Okay, I think I know where the issue is. I think the best solution is to "package" a snapshot of data to send in the main thread and queue it instead of the reference to the |
do you still need code to reproduce it? I probably cant share the original code and would have to recreate a minimal example hoping that triggers the error as well. |
No, I think I understand where the problem is. I've prepared a PR #860 - it solves the root cause of the problem I think, and we need one more PR probably to prevent experiments from hanging forever if some exception is happening. |
@f-fuchs could you please give the main branch a try and see if that is still happening? |
Not sure I can reproduce it again. The repository this code was from should have never posted to studio, due it being hosted on an internal gitlab repository. after this error I realized my setup was incorrect and removed the global studio access. therefore the error is not really a problem for me anymore just wanted to let u guys now that there is a potential bug. if u want to recreate it, i think the way i trigged it was having something like this
where two log_image calls could basically happen directly after the other if the last epoch was divisible by ten |
Hey,
I started logging images more specially matplotlib figures using log_image.
Everything seemed to be working but yesterday I started two experiments and they were both stuck in their last epoch.
When I checked the logs both of them report following RuntimeError:
So first off why did this not crash and just mark the experiment as failed but kept running indefinitely?
Secondly what is happening to cause the error?
The error does not occur using
dvc repro
, which makes sense because it occurs in thepost_to_studio
method.Also is there a way for me to stop posting to studio for just this repository?
The text was updated successfully, but these errors were encountered: