-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Float16 does not work #43
Comments
Hi, it might also not be worth it. If I am not wrong float16 is artifically capped in gamer hardware, i.e. the GTX 1080, to laughable performance, about ~30x slower GEMM. Not sure about Titan X though. |
We'll hopefully get access to some subset of Jade (22 DGX-1 even though everybody lobbied them to buy normal Pascals), Peta5 (P100s on PCI Express), and Azure has private beta for Pascals. Totally worth it for those. |
Oh. In that case carry on :) |
Using a learning rate 10 times smaller prevents the NaN, though I still get that strange warning, only during training. In terms of speed training is slightly faster on our machines, I will try to benchmark on a P100 if I have the chance. I didn't measure accuracy. |
Interesting. Thing is, it should not be faster. F16 arithmetics are severly capped. We benchmarked cublas hgemm vs sgemm on a GTX1080 once, it was slower by a factor of 28x . And from what I read that's intentional. |
Maybe Theano is doing something smart, or accidentally smart (like not using float16 for some Ops because they haven't implemented yet). |
Yeah, maybe on the CPU as well? Are float16 operations faster on our CPUs? |
Current Intel CPUs have float16 storage format but not operations. So there's an instruction to read at 16-bit float and expand to a 32-bit float then do the usual multiply or add instruction. |
out of interest, do you know if you're likely to get overflows when using
fp16, and if you're doing anything about it
* Looking for MT/NLP opportunities *
Hieu Hoang
http://moses-smt.org/
…On 14 June 2017 at 21:41, Kenneth Heafield ***@***.***> wrote:
Current Intel CPUs have float16 storage format but not operations. So
there's an instruction to read at 16-bit float and expand to a 32-bit float
then do the usual multiply or add instruction.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#43 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAqOFGUW565I2yGE0OmXC7lB6bdz1RSFks5sEEWFgaJpZM4N4_ge>
.
|
I ran more training benchmarks, including some on a Tesla P100 (thanks to Università di Pisa) and the results are that there is no noticeable difference between float32 and float16. As for the difference between the P100 and the TITAN X (Pascal), the TITAN X is actually equal or slightly faster, except when training with float64 (which is probably not very useful). I've tried with full-size models (--dim_word 512 --dim 1024) and batch size up to 256 and still got roughly the same speed between different machines. |
feedback from my own work with fp16 in amun. When running on a P100 (wilkes) it gives about a 20% speedup over using fp32. Most of the speedup is in the large matrix multiplication at the output layer. About to try again to speed up the rest of the code (element-wise operations etc) which requires much more work |
In this branch, I removed all hardcoded references to float32 and I tried to train with float16, but it does not work:
Using cuDNN version 5105 on context None
Mapped name None to device cuda0: TITAN X (Pascal) (0000:02:00.0)
Loading data
Building model
Building sampler
Building f_init... Done
Building f_next.. Done
Building f_log_probs... Done
Computing gradient... Done
Building optimizers...Disabling C code for Elemwise{Cast{float32}} due to unsupported float16
Done
Total compilation time: 198.4s
Optimization
Seen 846 samples
NaN detected
I've also tried increasing the epsilon in the Adam optimizer, but it doesn't solve the issue.
The text was updated successfully, but these errors were encountered: