Potential incompatibility with keras model checkpointing #23

munsanje · 2017-08-10T07:00:41Z

I recently adopted the multi_gpu module to parallelize learning across multiple gpus. On 8 K80 teslas I get a speed-up of roughly 4x, and learning appears to take place, as the loss goes down per iteration. However, when I actually test the model and visualize the results, it appears to perform in exactly the same way as without training. Previously, at the same loss I achieved while training with multi_gpu, I'd get drastically different performance. I've been working with this model for months and so have proven the learnability of the problem and the success of the architecture, so the results make no sense. I'm using keras's built-in ModelCheckpoint callback to automatically save my model after every epoch in which the validation loss has decreased. My guess is that there is a silent conflict between how the model is saved and this module. Any help debugging this would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

munsanje · 2017-08-10T08:35:50Z

Managed to fix this issue by slightly adapting @tstandley's solution in #3. Can confirm that the original multi_gpu code was the cause. Redefining both the model.save and model.save_weights functions as described in #3 solved the problem.
Code:

    save_model_function = type(model.save)
    def save_old_model(self_, model_path, overwrite=True):
        model.save(model_path, overwrite)
    new_model.save = save_model_function(save_old_model, new_model)
    # update weight saving scheme to save underlying model weights

    save_weights_function = type(model.save_weights)

    def save_old_weights(self_, weights_path, overwrite=True):
        model.save_weights(weights_path, overwrite)
    new_model.save_weights = save_weights_function(save_old_weights, new_model)
    return new_model

CeadeS · 2017-08-21T07:23:36Z

@munsanje do you have a example code for the solution? I have a similar problem and unfortunately not understanding where to put the code in, that you mentioned.

munsanje · 2017-08-21T07:29:42Z

@CeadeS Yeah sure. I modified the last segment of the module with the code:

# merge outputs on CPU
with tf.device('/cpu:0'):
    merged = []
    for outputs in outputs_all:
        merged.append(merge(outputs, mode='concat', concat_axis=0))

    # update model saving scheme to save underlying model rather than parallel
    new_model = Model(input=model.inputs, output=merged)
    save_model_function = type(model.save)

    def save_old_model(self_, model_path, overwrite=True):
        model.save(model_path, overwrite)
    new_model.save = save_model_function(save_old_model, new_model)
    # update weight saving scheme to save underlying model weights

    save_weights_function = type(model.save_weights)

    def save_old_weights(self_, weights_path, overwrite=True):
        model.save_weights(weights_path, overwrite)
    new_model.save_weights = save_weights_function(save_old_weights, new_model)
    return new_model

munsanje closed this as completed Aug 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential incompatibility with keras model checkpointing #23

Potential incompatibility with keras model checkpointing #23

munsanje commented Aug 10, 2017 •

edited

Loading

munsanje commented Aug 10, 2017

CeadeS commented Aug 21, 2017

munsanje commented Aug 21, 2017

Potential incompatibility with keras model checkpointing #23

Potential incompatibility with keras model checkpointing #23

Comments

munsanje commented Aug 10, 2017 • edited Loading

munsanje commented Aug 10, 2017

CeadeS commented Aug 21, 2017

munsanje commented Aug 21, 2017

munsanje commented Aug 10, 2017 •

edited

Loading