multi_gpu error #6

gattia · 2017-01-31T02:49:34Z

Hi,

I've tried to run the multi_gpu program to parallelize a convolution neural network that is based on U-Net. I am trying to parallelize on a g2.8xlarge to take advantage of the 4 GPUs.

Anyways, when trying to run the code I got an error. Below are both the full error, as well as the function used to define the model/call the multi_gpu function (make_parallel). The part that calls make_parallel is pretty much at the very end of the script/this post.

This may be super simple but I have no experience with tf and am just starting with keras. Any suggestions would be greatly appreciated.

Thanks,

Anthony.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-20-b2175f08a7a9> in <module>()
----> 1 modelTest = get_unet()
      2 print(modelTest.summary())

<ipython-input-18-7318b7dd3672> in get_unet()
     40 
     41     model = Model(input=inputs, output=conv10)
---> 42     model = make_parallel(model,3)
     43     model.compile(optimizer=Adam(lr=1e-5), loss=dice_coef_loss, metrics=[dice_coef])
     44 

/vol/programs/keras-extras/utils/multi_gpu.pyc in make_parallel(model, gpu_count)
     29                     inputs.append(slice_n)
     30 
---> 31                 outputs = model(inputs)
     32 
     33                 if not isinstance(outputs, list):

/home/ubuntu/miniconda/lib/python2.7/site-packages/Keras-1.0.4-py2.7.egg/keras/engine/topology.pyc in __call__(self, x, mask)
    483         if inbound_layers:
    484             # this will call layer.build() if necessary
--> 485             self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
    486             input_added = True
    487 

/home/ubuntu/miniconda/lib/python2.7/site-packages/Keras-1.0.4-py2.7.egg/keras/engine/topology.pyc in add_inbound_node(self, inbound_layers, node_indices, tensor_indices)
    541         # creating the node automatically updates self.inbound_nodes
    542         # as well as outbound_nodes on inbound layers.
--> 543         Node.create_node(self, inbound_layers, node_indices, tensor_indices)
    544 
    545     def get_output_shape_for(self, input_shape):

/home/ubuntu/miniconda/lib/python2.7/site-packages/Keras-1.0.4-py2.7.egg/keras/engine/topology.pyc in create_node(cls, outbound_layer, inbound_layers, node_indices, tensor_indices)
    146 
    147         if len(input_tensors) == 1:
--> 148             output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
    149             output_masks = to_list(outbound_layer.compute_mask(input_tensors[0], input_masks[0]))
    150             # TODO: try to auto-infer shape if exception is raised by get_output_shape_for

/home/ubuntu/miniconda/lib/python2.7/site-packages/Keras-1.0.4-py2.7.egg/keras/engine/topology.pyc in call(self, input, mask)
   1920             return self._output_tensor_cache[cache_key]
   1921         else:
-> 1922             output_tensors, output_masks, output_shapes = self.run_internal_graph(inputs, masks)
   1923             return output_tensors
   1924 

/home/ubuntu/miniconda/lib/python2.7/site-packages/Keras-1.0.4-py2.7.egg/keras/engine/topology.pyc in run_internal_graph(self, inputs, masks)
   2062                     if len(computed_data) == 1:
   2063                         computed_tensor, computed_mask = computed_data[0]
-> 2064                         output_tensors = to_list(layer.call(computed_tensor, computed_mask))
   2065                         output_masks = to_list(layer.compute_mask(computed_tensor, computed_mask))
   2066                         computed_tensors = [computed_tensor]

/home/ubuntu/miniconda/lib/python2.7/site-packages/Keras-1.0.4-py2.7.egg/keras/layers/convolutional.pyc in call(self, x, mask)
   1062     def call(self, x, mask=None):
   1063         return K.resize_images(x, self.size[0], self.size[1],
-> 1064                                self.dim_ordering)
   1065 
   1066     def get_config(self):

/home/ubuntu/miniconda/lib/python2.7/site-packages/Keras-1.0.4-py2.7.egg/keras/backend/tensorflow_backend.pyc in resize_images(X, height_factor, width_factor, dim_ordering)
    506         X = tf.image.resize_nearest_neighbor(X, new_shape)
    507         X = permute_dimensions(X, [0, 3, 1, 2])
--> 508         X.set_shape((None, None, original_shape[2] * height_factor, original_shape[3] * width_factor))
    509         return X
    510     elif dim_ordering == 'tf':

TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

def get_unet():
    inputs = Input((1, img_rows, img_cols))
    conv1 = Convolution2D(32, 3, 3, activation='relu', border_mode='same')(inputs)
    conv1 = Convolution2D(32, 3, 3, activation='relu', border_mode='same')(conv1)
    pool1 = MaxPooling2D(pool_size=(2, 2))(conv1)

    conv2 = Convolution2D(64, 3, 3, activation='relu', border_mode='same')(pool1)
    conv2 = Convolution2D(64, 3, 3, activation='relu', border_mode='same')(conv2)
    pool2 = MaxPooling2D(pool_size=(2, 2))(conv2)

    conv3 = Convolution2D(128, 3, 3, activation='relu', border_mode='same')(pool2)
    conv3 = Convolution2D(128, 3, 3, activation='relu', border_mode='same')(conv3)
    pool3 = MaxPooling2D(pool_size=(2, 2))(conv3)

    conv4 = Convolution2D(256, 3, 3, activation='relu', border_mode='same')(pool3)
    conv4 = Convolution2D(256, 3, 3, activation='relu', border_mode='same')(conv4)
    pool4 = MaxPooling2D(pool_size=(2, 2))(conv4)

    conv5 = Convolution2D(512, 3, 3, activation='relu', border_mode='same')(pool4)
    conv5 = Convolution2D(512, 3, 3, activation='relu', border_mode='same')(conv5)

    up6 = merge([UpSampling2D(size=(2, 2))(conv5), conv4], mode='concat', concat_axis=1)
    
    conv6 = Convolution2D(256, 3, 3, activation='relu', border_mode='same')(up6)
    conv6 = Convolution2D(256, 3, 3, activation='relu', border_mode='same')(conv6)

    up7 = merge([UpSampling2D(size=(2, 2))(conv6), conv3], mode='concat', concat_axis=1)
    conv7 = Convolution2D(128, 3, 3, activation='relu', border_mode='same')(up7)
    conv7 = Convolution2D(128, 3, 3, activation='relu', border_mode='same')(conv7)

    up8 = merge([UpSampling2D(size=(2, 2))(conv7), conv2], mode='concat', concat_axis=1)
    conv8 = Convolution2D(64, 3, 3, activation='relu', border_mode='same')(up8)
    conv8 = Convolution2D(64, 3, 3, activation='relu', border_mode='same')(conv8)

    up9 = merge([UpSampling2D(size=(2, 2))(conv8), conv1], mode='concat', concat_axis=1)
    conv9 = Convolution2D(32, 3, 3, activation='relu', border_mode='same')(up9)
    conv9 = Convolution2D(32, 3, 3, activation='relu', border_mode='same')(conv9)

    conv10 = Convolution2D(1, 1, 1, activation='sigmoid')(conv9)

    model = Model(input=inputs, output=conv10)
    model = make_parallel(model,3)
    
    model.compile(optimizer=Adam(lr=1e-5), loss=dice_coef_loss, metrics=[dice_coef])
    
    

    return model

The text was updated successfully, but these errors were encountered:

jakubLangr · 2017-04-02T21:28:28Z

+1 same problem here @gattia did you solve it in the end?

gattia · 2017-04-02T22:32:49Z

@jakubLangr I actually didn't. My solution ended up being to use a p2.xlarge. They are a bit more expensive, but if you use spot instances you can get them for an average of about 0.14$/hr, or at least that's been my experience for the last month or so.

jakubLangr · 2017-04-02T22:47:08Z

Ah, thanks. If you take a look at this PR, it gives me lots of warnings, but it feels like it is a step closer. Fixing data shape problem by beeva-enriqueotero · Pull Request #10 · kuza55/keras-extras It fixes #8 and probably #2 Regards github.com I then just struggle to get Keras to properly work. I think it allocates GPU memory correctly (but it also could have been due to other things I was running at the time), but it does not seem to use them and it gives me about ~100 of these warnings:WARNING:tensorflow:Tried to colocate gradients/tower_0_4/sequential_5/batch_normalization_15/moments/sufficient_statistics/count_grad/Rank with an op tower_0_4/sequential_5/batch_normalization_15/moments/sufficient_statistics/count that had a different device: /device:CPU:0 vs /device:GPU:0. Ignoring colocation property. Can post the whole error if that helps. Either way, it works afterwards but still only uses 1 GPU. On Sun, Apr 2, 2017 11:32 PM, Anthony [email protected] wrote: @jakubLangr I actually didn't. My solution ended up being to use a p2.xlarge. They are a bit more expensive, but if you use spot instances you can get them for an average of about 0.14$/hr, or at least that's been my experience for the last month or so. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jakubLangr · 2017-04-03T17:00:35Z

Oh and when I remove the batch normalization (which seemed to be the problem), I still fail to use multiple GPUs as judging by nvidia-smi output, but it does seem I actually allocate memory (at least for some time). On Sun, Apr 2, 2017 11:47 PM, Jakub Langr [email protected] wrote: Ah, thanks. If you take a look at this PR, it gives me lots of warnings, but it feels like it is a step closer. Fixing data shape problem by beeva-enriqueotero · Pull Request #10 · kuza55/keras-extras It fixes #8 and probably #2 Regards github.com I then just struggle to get Keras to properly work. I think it allocates GPU memory correctly (but it also could have been due to other things I was running at the time), but it does not seem to use them and it gives me about ~100 of these warnings:WARNING:tensorflow:Tried to colocate gradients/tower_0_4/sequential_5/batch_normalization_15/moments/sufficient_statistics/count_grad/Rank with an op tower_0_4/sequential_5/batch_normalization_15/moments/sufficient_statistics/count that had a different device: /device:CPU:0 vs /device:GPU:0. Ignoring colocation property. Can post the whole error if that helps. Either way, it works afterwards but still only uses 1 GPU. On Sun, Apr 2, 2017 11:32 PM, Anthony [email protected] wrote: @jakubLangr I actually didn't. My solution ended up being to use a p2.xlarge. They are a bit more expensive, but if you use spot instances you can get them for an average of about 0.14$/hr, or at least that's been my experience for the last month or so. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

gattia · 2017-04-11T18:38:59Z

Sorry, I dont really have time to look into this and have not been using the multi-gpu at the moment. I think that eventually when Keras is integrated into tensorflow that distributed training will be relatively easy.

I will report back if I do get around to looking into this.

Anthony.

zippeurfou · 2017-12-13T20:03:13Z

Any update on this? I am seeing a similar behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi_gpu error #6

multi_gpu error #6

gattia commented Jan 31, 2017

jakubLangr commented Apr 2, 2017

gattia commented Apr 2, 2017

jakubLangr commented Apr 2, 2017 via email

jakubLangr commented Apr 3, 2017 via email

gattia commented Apr 11, 2017

zippeurfou commented Dec 13, 2017

multi_gpu error #6

multi_gpu error #6

Comments

gattia commented Jan 31, 2017

jakubLangr commented Apr 2, 2017

gattia commented Apr 2, 2017

jakubLangr commented Apr 2, 2017 via email

jakubLangr commented Apr 3, 2017 via email

gattia commented Apr 11, 2017

zippeurfou commented Dec 13, 2017