-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The speed of using multi gpus #21
Comments
Yes, I observe similar behavior. For single GPU (GTX1070) the time / epoch converges to 4 s, for 2 GPUs to 6 s, whereas the optimum would be 2 s. For more GPUs, the time gets even worse. When I was comparing the code to CIFAR10 TensorFlow tutorial their code compute gradients in parallel and then averages them at the CPU. This code computes only the predictions in parallel. When I tried to log the operation placement the model parameters are not located on the GPU and also the gpu:0 has much more operations than gpu:1. |
Yes, I also find it. That is, if we want to "really" use multi GPUs, we have to use Tensorflow directly? I am wondering is there some ways we can improve the performance of multi GPUs in keras? |
I found an interesting observation and actually was able to make the Keras model parallelize well! The basic model has to be placed on cpu:0 device. By default it's placed on gpu:0. Working example with MNIST MLP:
After a few epochs the time per epoch stabilizes. Results on GTX 1070 GPUs:
|
Ahh, in case of 1 GPU, we should leave it on gpu:0, not cpu:0. Hm... so with this fixed, the basic model on 1 GPU runs at 2 s/ epoch (better than with more GPUs). The slow speed on 1 GPU setting was due to the model running on a CPU actually. |
I will check it and tell you the result: |
It seems to me that the problem might be caused by the fact that only predictions are computed in parallel. Then they are moved to the parameter server (cpu:0 or gpu:0) and gradients for the whole batch (gpu_count * batch_size) are computed on a single device! Computing gradients might be expensive. In the TensorFlow CIFAR10 tutorial they compute gradients on each device, move them and on the PS device average them and update weights. Another difference is that they explicitly do variable sharing via the scopes. I'd assume Keras does something similar under the hood when applying the base model multiple times, but I'm not sure about that. Another possible pain point might be a bit strange GPU topology on our machine (many cards, PCIe riser). I'll try to run the experiments on some standard cloud VM. As nicely noted in Caffe docs we have two options in setting batch size:
In the first case we expect lower training time. Due to less efficiency, it might be slower. Thus the preferred is the second case (give each GPU batch size optimal for one GPU). So in the second case we would expect after parallelization to N GPUs the time per batch to stay the same, while the time per epoch to be ideally |
@bzamecnik Thanks for looking into this; I thought this should be resolved since I patched Keras to passed in colocate_gradients_with_ops=True to TensorFlow. I wonder if I missed a place in Keras or if the option isn't working the way I expected it to. |
@kuza55 Aah. Thanks for mentioning this! I saw your PR regarding |
Actually, now I see it was merged at 30 Aug 2016, so it should have been released in 1.1.0. |
@bzamecnik Hey just trying to understand your version of make_parallel(). Why do you concatenate the model outputs into one "merged" output I sort of expected some sort of averaging operation. Is it because you train each output on it's own batch slice of data and that automatically causes the weight updates to average since they're all being applied to the same model/hidden layers? |
Can somebody explain how are gradients calculated for such model? It seems like parallelism only happens during forward pass unless tensorflow magically does parallelism same way for backward pass as well and if it does, where is code to average the gradients? |
@shivamkalra to parallelize backpropagation of gradients the colocate_gradients_with_ops flag should have been set to true. This should ensure that the gradient ops are run on the same device as the original op. |
@normanheckscher Thanks. Just trying to understand, so gradients are calculated on each slice of the data for backward pass on multiple gpus simultaneously and final descent (update) of parameters happens on CPU? And is the final update average of all the gradients from all the slices? |
Hi @kuza55 , where did you pass colocate_gradients_with_ops=True to TensorFlow? I didn't find it anywhere. |
It's merged into keras in the tensorflow backend when it calls tf.gradient
…On Dec 5, 2017 10:25 PM, "J" ***@***.***> wrote:
Hi @kuza55 <https://github.com/kuza55> , where did you pass
colocate_gradients_with_ops=True to TensorFlow? I didn't find it anywhere.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAXrkmUSVuSPkXM7RUfcjwReGlf__Nilks5s9jNzgaJpZM4Ok9F5>
.
|
@kuza55 Thanks. When I set |
@shivamkalra What happens under the hood with gradients is a bit opaque, but there's truly an implicit gradient averaging. In the Keras code we just put computing each slice outputs on each GPU, then merge them to CPU and compute loss. If I understand it correctly this is what happens. The loss is the sum of losses on each slice. And gradient of the loss for a batch is the average of gradients for each sample, ie. also for each slice. Thanks to As for exchanging gradients and weights between PS and GPU, it's very interesting to observe what TF with implicit copies does. I though there will be one big transfer there and one back. No. Not all weights / gradients are needed for computing each layer. So TF only implicitly copies what's necessary and also possibly makes some implicit copies in parallel to previous computations. It means that there's not less time cost for exchanging weights/gradients than expected. NVIDIA profiler + visual profiler are really good tools to explore what's in fact happening there. @jiang1st I'm not sure. When I examined the runs with |
Thanks @bzamecnik . Will try as you suggested. |
I also experienced this problem yesterday.When I increased batch_size,multi-gpu is faster than single.Maybe It is because increasing batch_size can make the GPU computational cost larger,but the communication-cost between CPU and GPU don't change. |
Hi, |
I use Tensorflow as the backend. And I use multi_gpu.py to achieve multi-gpus training. However, I find that the speed of using two gpus is almost same with using one gpu. Besides, I find when I use one gpu, the usage of gpu is almost 100%; but using two gpus, the usage og each gpu is about 40%-60%. How can I solve the problem?
My environment:
CPU: 40x Intel E5-2630 v4
Mem: 384GB
GPU: 4x NVIDIA GTX 1080 Ti
The text was updated successfully, but these errors were encountered: