-
-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hard error using dice loss #2383
Comments
My MWE from the discourse thread is this: julia> using Flux, CUDA
julia> let x = randn(3,5) |> cu
y = Flux.onehotbatch("abcab", 'a':'c') |> cu
Flux.dice_coeff_loss(x, y) # works forward
end
1.1841338f0
julia> let x = randn(3,5) |> cu
y = Flux.onehotbatch("abcab", 'a':'c') |> cu
gradient(Flux.mse, x, y) # some gradients work
end
(Float32[-0.16939788 -0.19461282 … -0.30000073 -0.017194644; 0.07464689 -0.15628384 … -0.17090265 -0.007114268; -0.22359066 -0.06903434 … 0.1566836 -0.022250716], nothing)
julia> let x = randn(3,5) |> cu
y = Flux.onehotbatch("abcab", 'a':'c') |> cu
gradient(Flux.dice_coeff_loss, x, y)
end
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
...
ERROR: KernelException: exception thrown during kernel execution on device Tesla V100-PCIE-16GB
Stacktrace:
[1] check_exceptions()
@ CUDA ~/.julia/packages/CUDA/htRwP/src/compiler/exceptions.jl:34
[2] device_synchronize(; blocking::Bool, spin::Bool)
@ CUDA ~/.julia/packages/CUDA/htRwP/lib/cudadrv/synchronization.jl:180
(@v1.10) pkg> st Flux CUDA
Status `~/.julia/environments/v1.10/Project.toml`
[052768ef] CUDA v5.2.0
[587475ba] Flux v0.14.11 I don't know if this is the same error as yours, but it's surprising, and is a bug. What "Run Julia on debug level 2 for device stack traces" means is that starting the REPL with |
Can you try pulling Flux.jl/src/losses/functions.jl Line 519 in 20d516b
|
Cheers, and sorry for long delay. To ease finding the root cause, have made my own dice_loss as follows:
If the However, if either of the first two |
And this is the attempt for capturing the error dump:
|
Cheers,
Regardless of the model, data, or any other condition, I’ve never been able of using the built-in Flux.dice_coeff_loss() function. A very long error dump shows up, apparently tied to CUDA and memory usage.
The issue has been confirmed and duplicated on Discourse forum. For details, please check this link.
The text was updated successfully, but these errors were encountered: