-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Enzyme functionality for both the GPU and CPU #6
Conversation
…a loop helper function of ocn_run. Had to temporarily comment out kernels, will re-add them
…quire @allowscalar macros. Still one summation that isn't working with AD
…rward run otherwise
…owscalar or broadcasting in part that it differentiated
show_output.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this file? Between the hard coded path and it being python, I'd opt to leave it out if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can remove this. I just had it in for convenience.
d_Prog = PrognosticVars(zeros(Float64, size(Prog.ssh)), | ||
zeros(Float64, size(Prog.normalVelocity)), | ||
zeros(Float64, size(Prog.layerThickness)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to make sure the shadow arrays don't need to be/can't be on GPU? If that's the case, then maybe this function doesn't need the backend
kwarg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, this function isn't actually used in the Enzyme tests. It's a holdout from before, we can remove it.
Add @test, forward, and CPU backend
…r test since forward mdoe on GPU no longer errors but still produces incorrect result
In order to avoid redefinitions of variables/function warnings.
Hey @jlk9, Those most recent commits fixed the enzyme test fails we talked about a couple weeks ago. Unfortunately, something is now broken in the module load julia/1.10.4
julia -O0 --color=yes --debug-info=2 --project=@. -e 'using Pkg; Pkg.test()' which results in: Error Log
which is failry confusing and annoying, because if I run:
I don't get any errors... The output looks as expected:
Because I can't reproduce the erros outside of
From checking the warnings,
results in
so there are |
@JBlaschke can you maybe help us out here? I usually prefer not to use the system CUDA when using CUDA.jl. At least not at first. @andrewdnolan if possible open a ticket at NERSC as described here. I had such issues on other systems. Can you also look at the CUDA.jl docs and paste your Edit: It might also help to add |
Also, can you use Julia 1.10.5? There were important bugfixes in that release that impact Enzyme. I recommend your own Julia installation. But maybe @JBlaschke knows more. |
@michel2323 I can take a look when I come back from vacation. RE the CUDA version: using the artifact can have two problems:
Problem 1 is easy enough to fix by pinning the artifact version to a CUDA version that's compatible with GTL. Problem 2 can be fixed by disabling CUDA-aware MPI (which entails a hit to performance) When I get back I can bump the CUDA module versions also. The Julia modules do two things:
You can try your own local install and with only the MPI part of the LocalPreferences.toml (i.e. removing the CUDA part). The MPI preferences are important to make MPI work at all. Anyway, that was just advice to help you be productive while I'm away. @andrewdnolan and @michel2323 when I get back what exactly would you like me to look at? Also: why do you prefer the artifact install in lieu of the system version? Our high-performance file systems have limited space so I'm weary of thousands of users installing the same system software over and over. |
@michel2323 Unfortunately, I can't use Julia 1.10.5 yet. I have ticket open with NERSC about some permission issue's I'm having with I fear the issue may have just been something silly on my end, mainly related to the
everything works as expected (and all test pass). With only those modules loaded,
and
Whereas with the default modules upon sshing into perlmutter,
and
So, long story short I think this may just be an issue with the default modules on perlmutter. Not really sure when this changed, since everything was working ok up until this PR. But, anyway I'll add a @JBlaschke Thanks for your guidance! Sorry to bother you on vacation! |
What's the INC number? I can take a look (should have some time between now and Thursday. Permission issues are probably a bug.
Oh my! (Clutches pearls) can you tell me why you're doing this. Not judging, this has a story which I would love to hear.
That makes sense. It's worth the effort though to figure out what aspects of the default modules breaks this (and fix the default modules) rather than doing extreme surgery on the NERSC software environment every time the defaults get in the way.
No worries -- this is interesting, so I will look at it whenever I get a moment |
The
Unfortunately, not a very good explanation here besides with the above warnings re the
So I added that to the list of commands. Is the
That sounds good to me! Thanks for taking the time to look into this! |
I'll apply for an account, and hopefully, I won't create more work for you two but remove some. Thanks so much @JBlaschke ! Happy vacation! |
TestingWith the environment/CUDA work around tested above I'm now able to run the unit tests and everything passes for me! To run the tests on
which resulted in: Unit Test Logs
Tests all pass okay, but there are lots of warnings related to duplicate I just added two commits that deal with those warnings, and a third that gets rid of that unneeded python file. I'll do the testing one more time around and post results here. |
Testing (Cont.)With those last couple commits the unit test log now looks like: Unit Test Logs
So, using that version of the code I also ran the inertial gravity wave convergence tests on both CPU and GPU on Tests on CPU produce produce the same results. So, with that I'll go ahead and finally merge this. @jlk9 sorry about the hold up here! Thanks for all this work! |
This PR builds on the enhancements for CPU Enzyme, shown here:
#5 (comment)
In addition to the changes in that PR, we have added a few more:
Vector{Array}
objects, to avoid excessive array slicing.forward/run_loop.jl
, which has the main loop for the forward model.We have also created a new test file,
test/enzyme/test_Enzyme_end2end.jl
, which tests an AD run of the entire model and compares with FD approximations. A separate test environment has been established with its own Project and Manifest files.