Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Printing results changes results in CPU kernel #485

Closed
jlk9 opened this issue Jun 24, 2024 · 21 comments
Closed

Printing results changes results in CPU kernel #485

jlk9 opened this issue Jun 24, 2024 · 21 comments

Comments

@jlk9
Copy link
Contributor

jlk9 commented Jun 24, 2024

This function computes a numerical gradient, then sums up the results using a KA kernel run on the CPU:

function gradient_normSq(grad, mesh::HorzMesh; backend=KA.CPU())

    nEdges = size(grad)[2]
    vert_levels = 1

    # New modified kernel:
    kernel! = GradientOnEdgeModified(backend)
    kernel!(mesh.Edges.dcEdge, grad, workgroupsize=64, ndrange=(nEdges, vert_levels))

    KA.synchronize(backend)

    @show grad

    normSq = 0.0
    for i = 1:nEdges
        normSq += grad[i]^2
    end

    return normSq
end

Computing numerical derivatives of this function produces the following output:

┌ Info:  (gradients)
│ 
│ For edge global index 1837
│ Enzyme computed 8.464896000000272e6
└ Finite differences computed 8.46489600001701e6

where Enzyme is an AD tool, and the finite difference computation uses these values:

dnorm_dgrad_fd = (normSqP - normSqM) / (gradNumPk - gradNumMk)
normSqP = 2.5367019692426506e14
normSqM = 2.5366708692147466e14
gradNumPk = 2020.7
gradNumMk = 1653.3

this makes sense: finite differences and an AD tool produce very similar results. However, what if we comment out the @show statement in gradient_normSq? Then we get this result:

normSqP = 3.362040598830664e-11
normSqM = 3.362040356541349e-11
gradNumPk = 2020.7
gradNumMk = 1653.3

┌ Info:  (gradients)
│ 
│ For edge global index 1837
│ Enzyme computed 8.464896000000272e6
└ Finite differences computed 6.5947010218182634e-21

The AD tool still works fine, but the perturbed norm computations for FD are now wrong. Nothing was changed except for removing an @show statement.

The code producing this error can be found in https://github.com/jlk9/MPAS-Ocean.jl/tree/bug-reduce, in the file bug.jl. I tried reducing it further, but getting rid of the mesh objects from the MPAS-Ocean package or the enzyme call caused the bug to disappear. Those don't directly affect the results of normSqP and normSqM, so maybe it's an issue of timing and CPU kernels being asynchronous? I call @synchronize() at the end of GradientOnEdgeModified, though.

@vchuravy
Copy link
Member

I call @synchronize() at the end of GradientOnEdgeModified, though.

This has no effect since you are not performing shared memory operations.

@jlk9
Copy link
Contributor Author

jlk9 commented Jun 24, 2024

That's what I thought, still mentioned it for completeness. I guess the KA.synchronize(backend) call should synchronize global memory, but it currently does nothing looking in the cpu.jl file.

@vchuravy
Copy link
Member

A yield() in the position of @show is sufficient.

@vchuravy
Copy link
Member

I guess the KA.synchronize(backend) call should synchronize global memory, but it currently does nothing looking

All kernel launches are synchronized on the CPU.

@sync for tid in 1:Nthreads
Threads.@spawn __thread_run(tid, len, rem, obj, ndrange, iterspace, args, dynamic)

@vchuravy
Copy link
Member

vchuravy commented Jun 24, 2024

This is very weird, when I comment out the call to Enzyme this spooky action from a distance disappears as well.

Also moving the FD calls before the Enzyme calls. So I would still suspect Enzyme doing something here that for some reason causes it to overlap with the FD code and corrupt some data...

Check if the Mesh is correct after Enzyme.

@jlk9
Copy link
Contributor Author

jlk9 commented Jun 24, 2024

I'll check that! The one caveat with Enzyme being the culprit is, getting rid of the MOKA mesh component and replacing it with a stand-in struct (you can see that in the commented out portion of bug.jl) also removed the bug, even with Enzyme still there. That doesn't rule out the idea of Enzyme affecting the mesh though

@vchuravy
Copy link
Member

Yeah my one hypothesis is that Enzyme creates a task that is not waited upon... and then modifies the mesh as well...
Which each alone would be a very weird bug indeed, but together makes me feel ridiculous for proposing it.

@jlk9
Copy link
Contributor Author

jlk9 commented Jun 24, 2024

Looks like the autodiff call permutes mesh.Edges.dcEdge

@jlk9
Copy link
Contributor Author

jlk9 commented Jun 24, 2024

Ok. So setting mesh.Edges.dcEdge to its pre-Enzyme value after calling Enzyme.autodiff makes the FD results correct:

mesh.Edges.dcEdge[:] = old_mesh.Edges.dcEdge[:]

Where old_mesh is a deepcopy() of mesh before the autodiff call. Interesting that dcEdge is used in the kernel, but not modified.

@jlk9
Copy link
Contributor Author

jlk9 commented Jun 25, 2024

Since this bug seems to be more of an Enzyme issue I've opened one there: EnzymeAD/Enzyme.jl#1569. Enzyme incorrectly permutes the array mesh.Edges.dcEdge, which altered the FD results.

@wsmoses
Copy link
Collaborator

wsmoses commented Jun 25, 2024

What happens if you do autodiff but all the values are constant

@wsmoses
Copy link
Collaborator

wsmoses commented Jun 25, 2024

Also can you isolate the error without as many dependencies?

@wsmoses
Copy link
Collaborator

wsmoses commented Jun 25, 2024

Per discussion here, I believe Enzyme is correct: EnzymeAD/Enzyme.jl#1569 (comment)

Specifically it behaves the same as calling the orginal code.

Enzyme.Const != immutable. It just means not differentiated. A variable which is Const but modified in place in the function to be differentiated will also be updated in place the same way during AD.

@wsmoses
Copy link
Collaborator

wsmoses commented Jun 25, 2024

Wait never mind, it looks like my computer just simply did not reproduce your issue.

What version of Enzyme/KA/etc are you using?

@wsmoses
Copy link
Collaborator

wsmoses commented Jun 25, 2024

I'm at:

wmoses@beast:~/git/Enzyme.jl/MPAS-Ocean.jl (bug-reduce) $ ls^C
wmoses@beast:~/git/Enzyme.jl/MPAS-Ocean.jl (bug-reduce) $ ~/git/Enzyme.jl/julia-1.10.2/bin/julia --project
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.2 (2024-03-01)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(MOKA) pkg> st
Project MOKA v0.1.0
Status `~/git/Enzyme.jl/MPAS-Ocean.jl/Project.toml`
⌃ [21141c5a] AMDGPU v0.9.5
  [7d9f7c33] Accessors v0.1.36
  [79e6a3ab] Adapt v4.0.4
  [6e4b80f9] BenchmarkTools v1.5.0
  [179af706] CFTime v0.1.3
  [052768ef] CUDA v5.4.2
  [8bb1440f] DelimitedFiles v1.9.1
  [7da242da] Enzyme v0.12.20 `..`
⌃ [28b8d3ca] GR v0.73.5
⌃ [63c18a36] KernelAbstractions v0.9.20
⌃ [da04e1cc] MPI v0.20.8
  [3da0fdf6] MPIPreferences v0.1.11
  [85f8d34a] NCDatasets v0.14.4
  [91a5bcdd] Plots v1.40.4
  [295af30f] Revise v3.5.14
  [09ab397b] StructArrays v0.6.18
  [3a884ed6] UnPack v1.0.2
  [ddb6d928] YAML v0.4.11
  [7cb0a576] MPICH_jll v4.2.1+1
  [ade2ca70] Dates
  [56ddb016] Logging
  [10745b16] Statistics v1.10.0
Info Packages marked with ⌃ have new versions available and may be upgradable.

(MOKA) pkg> ^C

julia> 
wmoses@beast:~/git/Enzyme.jl/MPAS-Ocean.jl (bug-reduce) $ cd ..
wmoses@beast:~/git/Enzyme.jl (uparm) $ git log
commit 758e7b42c45d1667c774fdbef938c349a414fa40 (HEAD -> uparm, origin/uparm)
Author: William S. Moses <[email protected]>
Date:   Tue Jun 25 14:05:19 2024 -0400

    uparm

commit f7ef35364f818539a494e6381100e6ac921ff339 (tag: v0.12.19, origin/main, origin/HEAD)
Author: William Moses <[email protected]>
Date:   Tue Jun 25 00:43:06 2024 -0400

    Update Project.toml

@jlk9
Copy link
Contributor Author

jlk9 commented Jun 25, 2024

grad and mesh are an array and struct respectively so they're duplicated objects. I tried replacing Horzmesh with a self-written struct in the file to remove the MOKA dependency, I'll try again.

@vchuravy
Copy link
Member

@jlk9 can try using Duplicated instead of Const? If that fixes it it's a case of "shadow-confusion",

but even then I am confused about why a @show or yield() changes things.

@jlk9
Copy link
Contributor Author

jlk9 commented Jun 25, 2024

Mentioned in the Enzyme issue too, I reduced the example to eliminate dependency on MOKA.jl. Now I'm just using a stand-in struct in bugMesh.jl that contains the dcEdges array

@jlk9
Copy link
Contributor Author

jlk9 commented Jun 25, 2024

@wsmoses Missed your question about Enzyme and KA versions, sorry! I was responding to some things on my phone. I'm on Enzyme 0.12.15 (what you get with just naive add Enzyme) and KA 0.9.20

@wsmoses
Copy link
Collaborator

wsmoses commented Jun 26, 2024

Replied in other issue, but repasting here (EnzymeAD/Enzyme.jl#1569 (comment))

Okay I see what's happening here, HorzMesh should likely not be marked inactive since you are differentiating wrt data within. This is implicitly causing runtime activity style issues, but not throwing an error for them.

@vchuravy
Copy link
Member

Let's continue this on Enzyme.jl.

The only KA relevant part might be the show that is needed for it to trigger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants