Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential incompatibility with keras model checkpointing #23

Closed
munsanje opened this issue Aug 10, 2017 · 3 comments
Closed

Potential incompatibility with keras model checkpointing #23

munsanje opened this issue Aug 10, 2017 · 3 comments

Comments

@munsanje
Copy link

munsanje commented Aug 10, 2017

I recently adopted the multi_gpu module to parallelize learning across multiple gpus. On 8 K80 teslas I get a speed-up of roughly 4x, and learning appears to take place, as the loss goes down per iteration. However, when I actually test the model and visualize the results, it appears to perform in exactly the same way as without training. Previously, at the same loss I achieved while training with multi_gpu, I'd get drastically different performance. I've been working with this model for months and so have proven the learnability of the problem and the success of the architecture, so the results make no sense. I'm using keras's built-in ModelCheckpoint callback to automatically save my model after every epoch in which the validation loss has decreased. My guess is that there is a silent conflict between how the model is saved and this module. Any help debugging this would be greatly appreciated.

@munsanje
Copy link
Author

Managed to fix this issue by slightly adapting @tstandley's solution in #3. Can confirm that the original multi_gpu code was the cause. Redefining both the model.save and model.save_weights functions as described in #3 solved the problem.
Code:

    save_model_function = type(model.save)
    def save_old_model(self_, model_path, overwrite=True):
        model.save(model_path, overwrite)
    new_model.save = save_model_function(save_old_model, new_model)
    # update weight saving scheme to save underlying model weights

    save_weights_function = type(model.save_weights)

    def save_old_weights(self_, weights_path, overwrite=True):
        model.save_weights(weights_path, overwrite)
    new_model.save_weights = save_weights_function(save_old_weights, new_model)
    return new_model

@CeadeS
Copy link

CeadeS commented Aug 21, 2017

@munsanje do you have a example code for the solution? I have a similar problem and unfortunately not understanding where to put the code in, that you mentioned.

@munsanje
Copy link
Author

@CeadeS Yeah sure. I modified the last segment of the module with the code:

# merge outputs on CPU
with tf.device('/cpu:0'):
    merged = []
    for outputs in outputs_all:
        merged.append(merge(outputs, mode='concat', concat_axis=0))

    # update model saving scheme to save underlying model rather than parallel
    new_model = Model(input=model.inputs, output=merged)
    save_model_function = type(model.save)

    def save_old_model(self_, model_path, overwrite=True):
        model.save(model_path, overwrite)
    new_model.save = save_model_function(save_old_model, new_model)
    # update weight saving scheme to save underlying model weights

    save_weights_function = type(model.save_weights)

    def save_old_weights(self_, weights_path, overwrite=True):
        model.save_weights(weights_path, overwrite)
    new_model.save_weights = save_weights_function(save_old_weights, new_model)
    return new_model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants