Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Enzyme and reexport ADTypes.AutoEnzyme #1887

Draft
wants to merge 87 commits into
base: master
Choose a base branch
from
Draft

Conversation

devmotion
Copy link
Member

@devmotion devmotion commented Sep 28, 2022

Note: This does not work yet


I opened this PR to make it easier to debug (and possibly fix) issues with Enzyme.

Currently, the following example does not work (note that the snippet does not require the PR which solely reexports AutoEnzyme at this point):

using Turing
using Enzyme
using ADTypes
Enzyme.API.runtimeActivity!(true);
Enzyme.API.typeWarning!(false);

@model function model()
    m ~ Normal(0, 1)
    s ~ InverseGamma()
    x ~ Normal(m, s)
end

sample(model() | (; x=0.5), NUTS(; adtype = ADTypes.AutoEnzyme()), 10)

With Enzyme#main my Julia (1.8.1) segfaults. An incomplete (it filled my whole terminal) output: https://gist.github.com/devmotion/1352197f2354c6fecddd7b778ec4bcf7#file-log-txt

The example works (latest releases of Turing, Enzyme, and ADTypes on Julia 1.10.0) but the following warnings show up:

warning: didn't implement memmove, using memcpy as fallback which can result in errors
warning: didn't implement memmove, using memcpy as fallback which can result in errors

@coveralls
Copy link

coveralls commented Nov 13, 2022

Pull Request Test Coverage Report for Build 12181846564

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 4 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.3%) to 37.831%

Files with Coverage Reduction New Missed Lines %
src/mcmc/Inference.jl 2 52.44%
src/mcmc/abstractmcmc.jl 2 92.86%
Totals Coverage Status
Change from base Build 12110228665: -0.3%
Covered Lines: 586
Relevant Lines: 1549

💛 - Coveralls

@codecov
Copy link

codecov bot commented Nov 13, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 44.46%. Comparing base (c0a4ee9) to head (20b055e).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1887      +/-   ##
==========================================
- Coverage   44.72%   44.46%   -0.26%     
==========================================
  Files          22       22              
  Lines        1554     1554              
==========================================
- Hits          695      691       -4     
- Misses        859      863       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/essential/ad.jl Outdated Show resolved Hide resolved
src/essential/ad.jl Outdated Show resolved Hide resolved
@wsmoses
Copy link
Collaborator

wsmoses commented Jun 26, 2023

Also if you want to disable the warnings you can set it like so (https://github.com/EnzymeAD/Enzyme.jl/blob/c29e6119c7963ddb22f1363726f762455748e193/src/api.jl#L414
)

Enzyme.API.typeWarning!(false)

@wsmoses
Copy link
Collaborator

wsmoses commented Jun 26, 2023

You also may want to set the version to 0.11.2 since your CI currently is running at 0.11.0 (⌃ [7da242da] Enzyme v0.11.0)

@wsmoses
Copy link
Collaborator

wsmoses commented Jun 27, 2023

@devmotion this PR (EnzymeAD/Enzyme.jl#914) should fix the immediate issues you see on CI if you want to try.


using AdvancedPS: AdvancedPS

include("container.jl")

export @model,
@varname,
AutoEnzyme,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's export this as Turing.Experimental.AutoEnzyme until Enzyme becomes more stable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the threshold for being considered stable here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhauru and @penelopeysm probably have a lot more experience on this.

My heuristic threshold:

  • Enzyme passes all Distributions.jl and Turing.jl tests
  • No known segfaults for Enzyme

for a continuous period of 8 weeks.

@mhauru
Copy link
Member

mhauru commented Oct 24, 2024

I've merged the latest master and upgraded to Enzyme v0.12. We are still being held back from v0.13 by Bijectors.jl. There are a number of new test failures, because

  1. There seem to be some regressions, tests that used to pass but now don't. I need to investigate.
  2. We had previously only tested Enzyme on the HMC and SGHMC tests. I hadn't realised/had forgotten that we weren't testing Enzyme with the full test suite. I've now added it to test/mcmc/gibbs.jl,test/mcmc/abstractmcmc.jl, and a couple of others as well. Everywhere where there is a loop over AD backends.
  3. Runtime activity is no longer a global setting in Enzyme. Having to change how we set runtime activity I think is a good opportunity to take stock of its effect, so I've just removed using it for now. Let's try to get to a point where the only test failures are ones where Enzyme says "you may need to use runtime activity", see how many there are, and only then enable it.

Getting Bijectors.jl to support Enzyme v0.13 I think has to be the next step, because otherwise any of the failures we see here might already be fixed on v0.13, and thus minimising and reporting them is pointless.

@wsmoses
Copy link
Collaborator

wsmoses commented Nov 6, 2024

gentle bump here

@yebai
Copy link
Member

yebai commented Nov 11, 2024

It would be good to address EnzymeAD/Enzyme.jl#1812, #2307 and TuringLang/Bijectors.jl#341 before merging this PR.

@mhauru
Copy link
Member

mhauru commented Nov 12, 2024

TuringLang/Bijectors.jl#341 at the very least needs addressing, because it's currently holding us back from running a recent Enzyme version here, and thus we don't know if the test suite would pass on a recent version. TuringLang/Bijectors.jl#341 passes tests on 1.10, but on 1.11 the Enzyme tests fail because of the accursed extension load order issue. The fix for that is currently waiting on this one: TuringLang/Bijectors.jl#346 EDIT and consequently this: TuringLang/Bijectors.jl#349

@wsmoses
Copy link
Collaborator

wsmoses commented Nov 29, 2024

With the bijectors fix landed, I suppose this is (again) ready to go?

@wsmoses
Copy link
Collaborator

wsmoses commented Nov 29, 2024

separately @yebai you appear to have removedd my permissions to run tests, if that can be restored

@yebai
Copy link
Member

yebai commented Dec 1, 2024

separately @yebai you appear to have removedd my permissions to run tests, if that can be restored

I don't know what happened precisely -- some changes were made to the TuringLang repos permissions to make CI work more robustly.

@wsmoses
Copy link
Collaborator

wsmoses commented Dec 2, 2024

┌ Warning: Could not use exact versions of packages in manifest, re-resolving
└ @ Pkg.Operations /opt/hostedtoolcache/julia/1.11.1/x86/share/julia/stdlib/v1.11/Pkg/src/Operations.jl:1[9](https://github.com/TuringLang/Turing.jl/actions/runs/12122531930/job/33795922092?pr=1887#step:8:10)02
ERROR: Unsatisfiable requirements detected for package DynamicPPL [366bfd00]:
 DynamicPPL [366bfd00] log:
 ├─possible versions are: 0.1.0 - 0.31.0 or uninstalled
 ├─restricted to versions [0.29, 0.30.4 - 0.31] by Turing [fce5fe82], leaving only versions: [0.29.0 - 0.29.2, 0.30.4 - 0.31.0]
 │ └─Turing [fce5fe82] log:
 │   ├─possible versions are: 0.35.3 or uninstalled
 │   └─Turing [fce5fe82] is fixed to version 0.35.3
 ├─restricted by compatibility requirements with Mooncake [da2b9cff] to versions: 0.29.0 - 0.30.5 or uninstalled, leaving only versions: [0.29.0 - 0.29.2, 0.30.4 - 0.30.5]
 │ └─Mooncake [da2b9cff] log:
 │   ├─possible versions are: 0.3.0 - 0.4.53 or uninstalled
 │   └─restricted to versions 0.4.19 - 0.4 by project [23fc8c3f], leaving only versions: 0.4.19 - 0.4.53
 │     └─project [23fc8c3f] log:
 │       ├─possible versions are: 0.0.0 or uninstalled
 │       └─project [23fc8c3f] is fixed to version 0.0.0
 └─restricted by compatibility requirements with Bijectors [76274a88] to versions: 0.31.0 or uninstalled — no versions left
   └─Bijectors [76274a88] log:
     ├─possible versions are: 0.1.0 - 0.15.2 or uninstalled
     ├─restricted to versions 0.14 - 0.15 by Turing [fce5fe82], leaving only versions: 0.14.0 - 0.15.2
     │ └─Turing [fce5fe82] log: see above
     └─restricted by compatibility requirements with Enzyme [7da242da] to versions: [0.1.0 - 0.[13](https://github.com/TuringLang/Turing.jl/actions/runs/12122531930/job/33795922092?pr=1887#step:8:14).16, 0.15.0 - 0.15.2] or uninstalled, leaving only versions: 0.15.0 - 0.15.2
       └─Enzyme [7da242da] log:
         ├─possible versions are: 0.1.0 - 0.13.18 or uninstalled
         └─restricted to versions 0.13 by project [23fc8c3f], leaving only versions: 0.13.0 - 0.13.18
           └─project [23fc8c3f] log: see above
Stacktrace:

I'd try to help, but I don't have permission to edit things or rerun CI xD

@penelopeysm
Copy link
Member

penelopeysm commented Dec 2, 2024

It should resolve with Mooncake 0.4.54 as that allows for DPPL=0.31.0. Don't know why CI isn't picking up the new version.

└─Mooncake [da2b9cff] log:
     ├─possible versions are: 0.3.0 - 0.4.53 or uninstalled

0.4.54 should have been available a few hours ago.

@mhauru
Copy link
Member

mhauru commented Dec 4, 2024

The registry issue is sorted, now merrily running with the latest Enzyme.

@wsmoses
Copy link
Collaborator

wsmoses commented Dec 4, 2024

Check ADType: Error During Test at /home/runner/work/Turing.jl/Turing.jl/test/mcmc/hmc.jl:334
  Got exception outside of a @test
  ArgumentError: Unsupported ADType: ADTypes.AutoEnzyme{Nothing, Nothing}
  Stacktrace:
    [1] Main.ADUtils.ADTypeCheckContext(adbackend::ADTypes.AutoEnzyme{Nothing, Nothing}, child::DynamicPPL.DefaultContext)
      @ Main.ADUtils ~/work/Turing.jl/Turing.jl/test/test_utils/ad_utils.jl:102
    [2] macro expansion
      @ ~/work/Turing.jl/Turing.jl/test/mcmc/hmc.jl:336 [inlined]
    [3] macro expansion
      @ /opt/hostedtoolcache/julia/1.11.2/x86/share/julia/stdlib/v1.11/Test/src/Test.jl:1704 [inlined]
    [4] macro expansion
      @ ~/work/Turing.jl/Turing.jl/test/mcmc/hmc.jl:335 [inlined]
    [5] macro expansion
      @ /opt/hostedtoolcache/julia/1.11.2/x86/share/julia/stdlib/v1.11/Test/src/Test.jl:1793 [inlined]
    [6] top-level scope
      @ ~/work/Turing.jl/Turing.jl/test/mcmc/hmc.jl:22
    [7] include(fname::String)
      @ Main ./sysimg.jl:38
    [8] macro expansion
      @ ~/.julia/packages/TimerOutputs/6KVfH/src/TimerOutput.jl:237 [inlined]
    [9] macro expansion
      @ ~/work/Turing.jl/Turing.jl/test/runtests.jl:26 [inlined]
   [10] macro expansion
      @ /opt/hostedtoolcache/julia/1.11.2/x86/share/julia/stdlib/v1.11/Test/src/Test.jl:1704 [inlined]
   [11] macro expansion
      @ ~/work/Turing.jl/Turing.jl/test/runtests.jl:56 [inlined]
   [12] macro expansion
      @ ~/.julia/packages/TimerOutputs/6KVfH/src/TimerOutput.jl:237 [inlined]
   [13] macro expansion
      @ ~/work/Turing.jl/Turing.jl/test/runtests.jl:54 [inlined]
   [14] macro expansion
      @ /opt/hostedtoolcache/julia/1.11.2/x86/share/julia/stdlib/v1.11/Test/src/Test.jl:1704 [inlined]
   [15] top-level scope
      @ ~/work/Turing.jl/Turing.jl/test/runtests.jl:34
   [16] include(fname::String)
      @ Main ./sysimg.jl:38
   [17] top-level scope
      @ none:6
   [18] eval
      @ ./boot.jl:430 [inlined]
   [19] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:296
   [20] _start()
      @ Base ./client.jl:531

Looks like something in turing needs to be updated?

@mhauru
Copy link
Member

mhauru commented Dec 5, 2024

Fixed the above issue that @wsmoses pointed out.

We are seeing a lot of illegal type analysis errors, which I suspect are all instances of EnzymeAD/Enzyme.jl#2169.
Too many to reasonably mark as broken, I think we need to get that fixed first and then have another look.

@wsmoses
Copy link
Collaborator

wsmoses commented Dec 5, 2024

So this is indicative of a union (which isn't presently fully supported, at least without setting Enzyme.API.strictAliasing!(false) which may permit it).

Something around here https://github.com/TuringLang/DynamicPPL.jl/blob/2252a9b6012da8e2ac56353770a0f848f6874357/src/abstract_varinfo.jl#L791 is sometimes an int and other times a double. I think this will need to be fixed on the turing side.

@wsmoses wsmoses mentioned this pull request Dec 6, 2024
10 tasks
@mhauru
Copy link
Member

mhauru commented Dec 6, 2024

If

@model function gdemo_copy()
    s ~ InverseGamma(2, 3)
end

fails, and it does, then I assume most Turing models are affected, since they don't really get simpler than that. We could look into trying to chase down that Union somewhere, but I'm surprised that this is an issue given that we should have type stability at most function boundaries for such a simple model, especially as "deep" in as invlink_with_logpdf (we've made sure of that for performance reasons). It could end up taking quite a lot of time to track down the issue on the Turing side.

Enzyme.API.strictAliasing!(false) doesn't seem to save us, the simplest MWE in the issue I made still fails.

@wsmoses has something in Enzyme gotten stricter so that these illegal type analysis errors come up more often nowadays? Some of the errors are from tests that already passed at an earlier point.

@yebai
Copy link
Member

yebai commented Dec 16, 2024

So this is indicative of a union (which isn't presently fully supported, at least without setting Enzyme.API.strictAliasing!(false) which may permit it). Something around here https://github.com/TuringLang/DynamicPPL.jl/blob/2252a9b6012da8e2ac56353770a0f848f6874357/src/abstract_varinfo.jl#L791 is sometimes an int and other times a double. I think this will need to be fixed on the turing side.

Ideally, a proper fix should be added to Enzyme instead of requiring packages like Turing.jl / DynamicPPL.jl to work around it. One good reason is that Turing allows arbitrary Julia code inside the @model macro, which will get hit again if users write code that involves union.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.