-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Printing results changes results in CPU kernel #485
Comments
This has no effect since you are not performing shared memory operations. |
That's what I thought, still mentioned it for completeness. I guess the KA.synchronize(backend) call should synchronize global memory, but it currently does nothing looking in the |
A |
All kernel launches are synchronized on the CPU. KernelAbstractions.jl/src/cpu.jl Lines 89 to 90 in 4285051
|
This is very weird, when I comment out the call to Enzyme this spooky action from a distance disappears as well. Also moving the FD calls before the Enzyme calls. So I would still suspect Enzyme doing something here that for some reason causes it to overlap with the FD code and corrupt some data... Check if the Mesh is correct after Enzyme. |
I'll check that! The one caveat with Enzyme being the culprit is, getting rid of the MOKA mesh component and replacing it with a stand-in struct (you can see that in the commented out portion of bug.jl) also removed the bug, even with Enzyme still there. That doesn't rule out the idea of Enzyme affecting the mesh though |
Yeah my one hypothesis is that Enzyme creates a task that is not waited upon... and then modifies the mesh as well... |
Looks like the autodiff call permutes |
Ok. So setting
Where old_mesh is a deepcopy() of mesh before the autodiff call. Interesting that dcEdge is used in the kernel, but not modified. |
Since this bug seems to be more of an Enzyme issue I've opened one there: EnzymeAD/Enzyme.jl#1569. Enzyme incorrectly permutes the array |
What happens if you do autodiff but all the values are constant |
Also can you isolate the error without as many dependencies? |
Per discussion here, I believe Enzyme is correct: EnzymeAD/Enzyme.jl#1569 (comment) Specifically it behaves the same as calling the orginal code. Enzyme.Const != immutable. It just means not differentiated. A variable which is Const but modified in place in the function to be differentiated will also be updated in place the same way during AD. |
Wait never mind, it looks like my computer just simply did not reproduce your issue. What version of Enzyme/KA/etc are you using? |
I'm at: wmoses@beast:~/git/Enzyme.jl/MPAS-Ocean.jl (bug-reduce) $ ls^C
wmoses@beast:~/git/Enzyme.jl/MPAS-Ocean.jl (bug-reduce) $ ~/git/Enzyme.jl/julia-1.10.2/bin/julia --project
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.10.2 (2024-03-01)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
(MOKA) pkg> st
Project MOKA v0.1.0
Status `~/git/Enzyme.jl/MPAS-Ocean.jl/Project.toml`
⌃ [21141c5a] AMDGPU v0.9.5
[7d9f7c33] Accessors v0.1.36
[79e6a3ab] Adapt v4.0.4
[6e4b80f9] BenchmarkTools v1.5.0
[179af706] CFTime v0.1.3
[052768ef] CUDA v5.4.2
[8bb1440f] DelimitedFiles v1.9.1
[7da242da] Enzyme v0.12.20 `..`
⌃ [28b8d3ca] GR v0.73.5
⌃ [63c18a36] KernelAbstractions v0.9.20
⌃ [da04e1cc] MPI v0.20.8
[3da0fdf6] MPIPreferences v0.1.11
[85f8d34a] NCDatasets v0.14.4
[91a5bcdd] Plots v1.40.4
[295af30f] Revise v3.5.14
[09ab397b] StructArrays v0.6.18
[3a884ed6] UnPack v1.0.2
[ddb6d928] YAML v0.4.11
[7cb0a576] MPICH_jll v4.2.1+1
[ade2ca70] Dates
[56ddb016] Logging
[10745b16] Statistics v1.10.0
Info Packages marked with ⌃ have new versions available and may be upgradable.
(MOKA) pkg> ^C
julia>
wmoses@beast:~/git/Enzyme.jl/MPAS-Ocean.jl (bug-reduce) $ cd ..
wmoses@beast:~/git/Enzyme.jl (uparm) $ git log
commit 758e7b42c45d1667c774fdbef938c349a414fa40 (HEAD -> uparm, origin/uparm)
Author: William S. Moses <[email protected]>
Date: Tue Jun 25 14:05:19 2024 -0400
uparm
commit f7ef35364f818539a494e6381100e6ac921ff339 (tag: v0.12.19, origin/main, origin/HEAD)
Author: William Moses <[email protected]>
Date: Tue Jun 25 00:43:06 2024 -0400
Update Project.toml |
grad and mesh are an array and struct respectively so they're duplicated objects. I tried replacing Horzmesh with a self-written struct in the file to remove the MOKA dependency, I'll try again. |
@jlk9 can try using but even then I am confused about why a |
Mentioned in the Enzyme issue too, I reduced the example to eliminate dependency on MOKA.jl. Now I'm just using a stand-in struct in bugMesh.jl that contains the dcEdges array |
@wsmoses Missed your question about Enzyme and KA versions, sorry! I was responding to some things on my phone. I'm on Enzyme 0.12.15 (what you get with just naive add Enzyme) and KA 0.9.20 |
Replied in other issue, but repasting here (EnzymeAD/Enzyme.jl#1569 (comment)) Okay I see what's happening here, HorzMesh should likely not be marked inactive since you are differentiating wrt data within. This is implicitly causing runtime activity style issues, but not throwing an error for them. |
Let's continue this on Enzyme.jl. The only KA relevant part might be the show that is needed for it to trigger. |
This function computes a numerical gradient, then sums up the results using a KA kernel run on the CPU:
Computing numerical derivatives of this function produces the following output:
where Enzyme is an AD tool, and the finite difference computation uses these values:
this makes sense: finite differences and an AD tool produce very similar results. However, what if we comment out the
@show
statement ingradient_normSq
? Then we get this result:The AD tool still works fine, but the perturbed norm computations for FD are now wrong. Nothing was changed except for removing an
@show
statement.The code producing this error can be found in
https://github.com/jlk9/MPAS-Ocean.jl/tree/bug-reduce
, in the filebug.jl
. I tried reducing it further, but getting rid of the mesh objects from the MPAS-Ocean package or the enzyme call caused the bug to disappear. Those don't directly affect the results ofnormSqP
andnormSqM
, so maybe it's an issue of timing and CPU kernels being asynchronous? I call@synchronize()
at the end ofGradientOnEdgeModified
, though.The text was updated successfully, but these errors were encountered: