-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential incompatibility with keras model checkpointing #23
Comments
Managed to fix this issue by slightly adapting @tstandley's solution in #3. Can confirm that the original
|
@munsanje do you have a example code for the solution? I have a similar problem and unfortunately not understanding where to put the code in, that you mentioned. |
@CeadeS Yeah sure. I modified the last segment of the module with the code:
|
I recently adopted the
multi_gpu
module to parallelize learning across multiple gpus. On 8 K80 teslas I get a speed-up of roughly 4x, and learning appears to take place, as the loss goes down per iteration. However, when I actually test the model and visualize the results, it appears to perform in exactly the same way as without training. Previously, at the same loss I achieved while training withmulti_gpu
, I'd get drastically different performance. I've been working with this model for months and so have proven the learnability of the problem and the success of the architecture, so the results make no sense. I'm using keras's built-inModelCheckpoint
callback to automatically save my model after every epoch in which the validation loss has decreased. My guess is that there is a silent conflict between how the model is saved and this module. Any help debugging this would be greatly appreciated.The text was updated successfully, but these errors were encountered: