Update onnxruntime to 1.20.1 #40

svilupp · 2024-11-25T19:46:41Z

First of all, thank you for this amazing package!!

I've hit some issues with the outdated binary we were using (I needed IR v10), so I've updated the repo accordingly.
In addition, I've also changed the macos aarch64 to the correct binary - they now produce a native one.

TODO list
[x] Update all artifacts links to 1.20.1, update SHA1 and SHA256 values
[x] Update src/versions.jl to Cuda 12.0 (as per the announcement of onnxruntime 1.19: "Default GPU packages use CUDA 12.x and Cudnn 9.x (previously CUDA 11.x/CuDNN 8.x) CUDA 11.x/CuDNN 8.x packages are moved to the aiinfra VS feed.")
[x] Update the reference to Cuda 11.8 on the README page
[x] Ran tests (locally) -- all passed
[x] Verified that the package loads the model with IR10 and all works

jw3126 · 2024-11-26T06:48:18Z

Thanks a lot! I can help later with the windows CI fail. The issue is that the artifact system only understands tar, while the official binaries are zip.

svilupp · 2024-11-26T07:11:15Z

Ah, got it! Thank you

jw3126 · 2024-11-26T07:52:47Z

I updated the binaries. Also I changed the osx URL, to universal2. Is that the right thing? I would like this to run out of the box on say 4 year old macs. I assume the arm binaries don't give you that, but am not a mac user and interested in your comments.

jw3126 · 2024-11-26T07:57:17Z

If the osx platform needs to be tweaked further, can you do it in a PR that updates https://github.com/jw3126/ONNXRunTimeArtifacts

svilupp · 2024-11-26T08:49:12Z

I updated the binaries. Also I changed the osx URL, to universal2. Is that the right thing? I would like this to run out of the box on say 4 year old macs. I assume the arm binaries don't give you that, but am not a mac user and interested in your comments.

The artifacts are platform-specific, so old macs are on x86 with universal tarball. New macs (aarch64) are on their specific tarball. Both will work as they should.
I'm reverting the change back as I had it.

svilupp · 2024-11-26T08:54:12Z

I went through the CUDA test failures and I think they trace back to GPUCompiler.jl.

It seems to be a bug which has been resolved in newer versions: https://github.com/JuliaGPU/GPUCompiler.jl/blob/09b4708ba12e0b19e40f85c64e9105cf666c4d62/src/GPUCompiler.jl#L60C2-L63C64

It has this block suggesting it's a known issue.

if pkgver !== nothing
# XXX: Base.pkgversion is buggy and sometimes returns nothing, see e.g.
# JuliaLang/PackageCompiler.jl#896 and JuliaGPU/GPUCompiler.jl#593
dir = joinpath(dir, "v$(pkgver.major).$(pkgver.minor)")
end

I think we can ignore it.

svilupp · 2024-11-27T19:35:48Z

Do you have any further thoughts on the PR?

jw3126 · 2024-11-28T07:18:13Z

Thanks!

Could you update this script https://github.com/jw3126/ONNXRunTimeArtifacts/blob/fc288d6485e1079b3e3675c44bee14b2341536d1/script.jl#L49 to make sure we get the right binaries, next time I bump onnxruntime version?
Thanks for digging into the error. Even if it is a known issue, we need some workaround. Especially on version 1 CI fail is not acceptable.

svilupp · 2024-11-28T07:28:49Z

I'll update the script.

For the error, it's not solvable by us since a dependency fails.
I believe the solution is to update the dep tree - we are at GPUcompiler v0.25 and there is already 1.0. Somewhere in between those is a fix (I linked it above)

I'm happy to experiment with it, but can you give me permission to run the CI? I don't have GPU so can't reproduce it and it will take forever if I can't iterate it quickly.

svilupp · 2024-11-28T07:34:49Z

Saving my notes here.
The bug was fixed on 4th July: JuliaGPU/GPUCompiler.jl#594

which would make the fixed release GPUCompiler 0.26.7

svilupp · 2024-11-28T07:40:45Z

@jw3126 Why do we have CuDNN dep in this package? I understand CUDA for the extension but not CuDNN.

I'm fairly confident that the failing version of GPUCompiler is enforced by CuDNN version.

It seems that it's phased out: https://github.com/JuliaAttic/CUDNN.jl

They suggest to use CuArrays.jl, which is fully absorbed by CUDA.jl

So all in all, I'd suggest removing CuDNN dep instead of tweaking the versions.

jw3126 · 2024-11-28T08:00:02Z

@jw3126 Why do we have CuDNN dep in this package? I understand CUDA for the extension but not CuDNN.

Would love to get rid of that. In the past we had issues with libcudnn (or similar name) not found by onnxruntime and this is how we dealt with it. Nowadays there may or may not be better ways. If you can make linux (or windows) GPU support work with a lighter dependency that would be awesome.

svilupp · 2024-11-28T09:56:12Z

I don't know enough to guarantee that it works and no way to test it.

As a first pass, I've bumped up cuDNN to 1.3, which hits the patch version 1.3.2 -> has the fixed GPUCompiler v0.26.7

You can trigger the CI at your convenience

jw3126 · 2024-11-28T09:57:11Z

Thanks I will locally check if it works on linux later

svilupp · 2024-11-28T12:05:38Z

The current errors are CUDA.jl + Julia 1.12 related. I don't think it's supported yet, see their own CI: https://buildkite.com/julialang/cuda-dot-jl/builds/5531#0193683c-531a-4f7e-ad30-23e4e167be72

GunnarFarneback · 2024-11-28T21:02:57Z

Why do we have CuDNN dep in this package? I understand CUDA for the extension but not CuDNN.

Both the CUDA and cuDNN weak dependencies are sort of fake. We really only depend on them in order to get the right artifacts loaded, including libcudnn, so that libonnxruntime can link to them.

jw3126 · 2024-11-29T07:06:34Z

I checked and could not get this branch to work locally. libcudnn does still not ship with CUDA.jl.

svilupp · 2024-11-29T07:24:27Z

Ah, that's a shame. Thanks for trying!
I guess I'll keep using my fork.

Btw why do we need to pass CI for nightly builds? Shouldn't we check 1.11 instead?

jw3126 · 2024-11-29T07:43:45Z

Yeah sorry, but convnets are pretty common, we cant break them. Julia 1 is always the current release (1.11 right now). Breaking nightly is not a merge blocker.

svilupp · 2024-11-29T07:48:05Z

Ah, yes, of course! I overlooked the Julia 1 CI on my phone. So everything is passing besides the nightly? That’s good!

So will this be merged after all? I took your comment as though it doesn’t work, but CI looks good.

jw3126 · 2024-11-29T07:53:12Z

It will not be merged. CI passes, because we have no GPU coverage (I don't know a way to run GPU CI for free). But my local testing shows GPU does not work because there is no libcudnn. So if we merged this, we would break GPU support.

svilupp · 2024-11-29T07:59:19Z

Got it! Thanks for explaining.

Is there value in updating just the macos artifact on 1.15 in current version? It wouldn't help me, but at least it's native - a lot of Julia community have arm macs.

I don't fully understand the failures, are there any next steps (sth to do) or just sit and wait?

jw3126 · 2024-11-29T08:08:29Z

https://discourse.julialang.org/t/obtaining-modern-cuda-libraries/123237

jw3126 · 2024-11-29T08:12:43Z

Is there value in updating just the macos artifact on 1.15 in current version? It wouldn't help me, but at least it's native - a lot of Julia community have arm macs.

Sure good suggestion.

jw3126 · 2024-11-29T08:14:26Z

I am not exactly keen on this, but making onnxruntime version a preference is something one can think about.

GunnarFarneback · 2024-11-29T08:19:03Z

Presumably we could target the CUDA artifacts directly rather than the higher level packages, but first we would need to find where the relevant libraries are. I'll have a look at what has changed in the CUDA packaging.

GunnarFarneback · 2024-11-29T10:08:00Z

It seems that it's phased out: https://github.com/JuliaAttic/CUDNN.jl

That repository and any information there is irrelevant. The current cuDNN package lives here: https://github.com/JuliaGPU/CUDA.jl/tree/master/lib/cudnn.

GunnarFarneback · 2024-11-29T10:50:58Z

As far as I can tell nothing has changed in the CUDA packaging recently. I think the only update needed is

diff --git a/test/LocalPreferences.toml b/test/LocalPreferences.toml
index 5da06c7..ef09e81 100644
--- a/test/LocalPreferences.toml
+++ b/test/LocalPreferences.toml
@@ -1,2 +1,2 @@
 [CUDA_Runtime_jll]
-version = "11.8"
+version = "12.0"

which effectively is what happens if you run CUDA.set_runtime_version!(v"12.0") in the test environment. ~~Possibly we can also loosen the cuDNN compat.~~

We get the right libcudnn (for this version of libonnxruntime) from cuDNN versions 1.3.1, 1.3.2, and 1.4.0 (latest). We have no idea when libcudnn will be bumped next though (last time it changed version between cuDNN 1.3.0 and 1.3.1) so we should probably set cuDNN compat to either the conservative "~1.3.1, =1.4.0" or the more optimistic "~1.3.1, ~1.4".

For good measure the cuDNN compat in test/Project.toml should also be bumped to "1.4", although it doesn't matter in practice.

jw3126 · 2024-11-29T11:20:36Z

@GunnarFarneback thanks! Did you try that? I still get:

  /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1539 onnxruntime::Provider& onnxr
untime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime
_providers_cuda.so with error: libcudnn.so.9: cannot open shared object file: No such file or direct
ory

with your suggestion.

GunnarFarneback · 2024-11-29T13:37:56Z

Yes, with the LocalPreferences update and compat that resolves to a cuDNN 1.3.1 or higher it works locally for me.

GunnarFarneback · 2024-11-29T13:41:49Z

What versions of cuDNN and CUDNN_jll do you see in the environment when running the tests?

jw3126 · 2024-11-29T14:41:19Z

I just saw this error:

┌ Error: cuDNN is not available for your platform (x86_64-linux-gnu-libgfortran5-cxx11-libstdcxx30-c
uda+none-julia_version+1.11.1)
└ @ cuDNN ~/.julia/packages/cuDNN/P9S4N/src/cuDNN.jl:177

  [02a925ec] cuDNN v1.4.0
  [4ee394cb] CUDA_Driver_jll v0.10.4+0
  [76a88914] CUDA_Runtime_jll v0.15.5+0
  [62b44479] CUDNN_jll v9.4.0+0

jw3126 · 2024-11-29T14:49:26Z

I rebooted and recreated the environment. Tests pass, but I get the following warning:

2024-11-29 15:48:12.845201050 [W:onnxruntime:defaultenv, conv.cc:425 UpdateState] OP Conv(Conv_0) ru
nning in Fallback mode. May be extremely slow.

(ONNXRunTime) pkg> test
     Testing ONNXRunTime
      Status `/tmp/jl_6OAx3Y/Project.toml`
  [052768ef] CUDA v5.5.2
  [e034b28e] ONNXRunTime v1.3.0 `~/.julia/dev/ONNXRunTime`
  [02a925ec] cuDNN v1.4.0
  [8dfed614] Test v1.11.0
      Status `/tmp/jl_6OAx3Y/Manifest.toml`
  [621f4979] AbstractFFTs v1.5.0
  [79e6a3ab] Adapt v4.1.1
  [dce04be8] ArgCheck v2.4.0
⌅ [a9b6321e] Atomix v0.1.0
  [ab4f0b2a] BFloat16s v0.5.0
  [fa961155] CEnum v0.5.0
  [052768ef] CUDA v5.5.2
  [1af6417a] CUDA_Runtime_Discovery v0.3.5
  [3da002f7] ColorTypes v0.12.0
  [5ae59095] Colors v0.13.0
  [34da2185] Compat v4.16.0
  [a8cc5b0e] Crayons v4.1.1
  [9a962f9c] DataAPI v1.16.0
  [a93c6f00] DataFrames v1.7.0
  [864edb3b] DataStructures v0.18.20
  [e2d170a0] DataValueInterfaces v1.0.0
  [ffbed154] DocStringExtensions v0.9.3
  [e2ba6199] ExprTools v0.1.10
  [53c48c17] FixedPointNumbers v0.8.5
⌅ [0c68f7d7] GPUArrays v10.3.1
⌅ [46192b85] GPUArraysCore v0.1.6
⌅ [61eb1bfa] GPUCompiler v0.27.8
  [842dd82b] InlineStrings v1.4.2
  [41ab1584] InvertedIndices v1.3.0
  [82899510] IteratorInterfaceExtensions v1.0.0
  [692b3bcd] JLLWrappers v1.6.1
  [63c18a36] KernelAbstractions v0.9.29
  [929cbde3] LLVM v9.1.3
  [8b046642] LLVMLoopInfo v1.0.0
  [b964fa9f] LaTeXStrings v1.4.0
  [1914dd2f] MacroTools v0.5.13
  [e1d29d7a] Missings v1.2.0
  [5da4648a] NVTX v0.3.5
  [e034b28e] ONNXRunTime v1.3.0 `~/.julia/dev/ONNXRunTime`
  [bac558e1] OrderedCollections v1.7.0
  [2dfb63ee] PooledArrays v1.4.3
  [aea7be01] PrecompileTools v1.2.1
  [21216c6a] Preferences v1.4.3
  [08abe8d2] PrettyTables v2.4.0
  [74087812] Random123 v1.7.0
  [e6cf234a] RandomNumbers v1.6.0
  [189a3867] Reexport v1.2.2
  [ae029012] Requires v1.3.0
  [6c6a2e73] Scratch v1.2.1
  [91c51154] SentinelArrays v1.4.7
  [a2af1166] SortingAlgorithms v1.2.1
  [90137ffa] StaticArrays v1.9.8
  [1e83bf80] StaticArraysCore v1.4.3
  [10745b16] Statistics v1.11.1
  [892a3eda] StringManipulation v0.4.0
  [3783bdb8] TableTraits v1.0.1
  [bd369af6] Tables v1.12.0
  [a759f4b9] TimerOutputs v0.5.25
  [013be700] UnsafeAtomics v0.2.1
  [d80eeb9a] UnsafeAtomicsLLVM v0.2.1
  [02a925ec] cuDNN v1.4.0
  [4ee394cb] CUDA_Driver_jll v0.10.4+0
  [76a88914] CUDA_Runtime_jll v0.15.5+0
  [62b44479] CUDNN_jll v9.4.0+0
  [9c1d0b0a] JuliaNVTXCallbacks_jll v0.2.1+0
  [dad2f222] LLVMExtra_jll v0.0.34+0
  [e98f9f5b] NVTX_jll v3.1.0+2
  [1e29f10c] demumble_jll v1.3.0+0
  [0dad84c5] ArgTools v1.1.2
  [56f22d72] Artifacts v1.11.0
  [2a0f44e3] Base64 v1.11.0
  [ade2ca70] Dates v1.11.0
  [f43a241f] Downloads v1.6.0
  [7b1f6079] FileWatching v1.11.0
  [9fa8497b] Future v1.11.0
  [b77e0a4c] InteractiveUtils v1.11.0
  [4af54fe1] LazyArtifacts v1.11.0
  [b27032c2] LibCURL v0.6.4
  [76f85450] LibGit2 v1.11.0
  [8f399da3] Libdl v1.11.0
  [37e2e46d] LinearAlgebra v1.11.0
  [56ddb016] Logging v1.11.0
  [d6f4376e] Markdown v1.11.0
  [ca575930] NetworkOptions v1.2.0
  [44cfe95a] Pkg v1.11.0
  [de0858da] Printf v1.11.0
  [9a3f8284] Random v1.11.0
  [ea8e919c] SHA v0.7.0
  [9e88b42a] Serialization v1.11.0
  [2f01184e] SparseArrays v1.11.0
  [fa267f1f] TOML v1.0.3
  [a4e569a6] Tar v1.10.0
  [8dfed614] Test v1.11.0
  [cf7118a7] UUIDs v1.11.0
  [4ec0a83e] Unicode v1.11.0
  [e66e0078] CompilerSupportLibraries_jll v1.1.1+0
  [deac9b47] LibCURL_jll v8.6.0+0
  [e37daf67] LibGit2_jll v1.7.2+0
  [29816b5a] LibSSH2_jll v1.11.0+1
  [c8ffd9c3] MbedTLS_jll v2.28.6+0
  [14a3606d] MozillaCACerts_jll v2023.12.12
  [4536629a] OpenBLAS_jll v0.3.27+1
  [bea87d4a] SuiteSparse_jll v7.7.0+0
  [83775a58] Zlib_jll v1.2.13+1
  [8e850b90] libblastrampoline_jll v5.11.0+0
  [8e850ede] nghttp2_jll v1.59.0+0
  [3f19e933] p7zip_jll v17.4.0+2
        Info Packages marked with ⌅ have new versions available but compatibility constraints restri
ct them from upgrading.
     Testing Running tests...
Test Summary:               | Pass  Total  Time
ONNXRunTime library version |    1      1  0.2s
Test Summary:                           | Pass  Total  Time
Minimum CUDA runtime version in README. |    3      3  0.0s
Test Summary: | Pass  Total  Time
high level    |  112    112  1.4s
Test Summary: | Pass  Total  Time
Session       |   25     25  0.5s
Test Summary:    | Pass  Total  Time
tensor roundtrip |    9      9  0.1s
2024-11-29 15:48:12.845201050 [W:onnxruntime:defaultenv, conv.cc:425 UpdateState] OP Conv(Conv_0) ru
nning in Fallback mode. May be extremely slow.
Test Summary:   | Pass  Total  Time
CUDA high level |   22     22  0.3s
Test Summary:  | Pass  Total  Time
CUDA low level |   11     11  0.1s
     Testing ONNXRunTime tests passed

GunnarFarneback · 2024-11-29T15:49:28Z

I get that warning too. It happens in https://github.com/jw3126/ONNXRunTime.jl/blob/main/test/test_cuda.jl#L35 when conv_search is :DEFAULT. It seems like a regression in the CUDA execution provider, but generally I guess the execution providers are allowed to fall back to CPU if they feel a need to. I don't think we can do much about it other than revising our tests.

svilupp · 2024-11-29T21:32:42Z

Oh, that's an amazing progress! Thank you both for looking into it.

I checked the 1.9 failures -- it's because CuDNN 1.4 dropped support for it: https://github.com/JuliaGPU/CUDA.jl/blob/7ff012f21ecaf9364a348289a136deebe299e8d9/lib/cudnn/Project.toml#L17

jw3126 · 2024-11-30T08:53:51Z

@GunnarFarneback does this PR look good to you? In particular any other places that need version adjustments?

jw3126 · 2024-11-30T08:55:45Z

Oh, that's an amazing progress! Thank you both for looking into it.

I checked the 1.9 failures -- it's because CuDNN 1.4 dropped support for it: https://github.com/JuliaGPU/CUDA.jl/blob/7ff012f21ecaf9364a348289a136deebe299e8d9/lib/cudnn/Project.toml#L17

Could you add 1.10 CI, just so that we are conscious which is our min julia version. If it does not run on 1.10 that is also fine, just bump to 1.11, even if that means running CI twice.

svilupp · 2024-11-30T08:58:24Z

I've done that yesterday night. Check the CI results and compat in Project.toml

Or do you mean something else?

jw3126 · 2024-11-30T09:30:37Z

I've done that yesterday night. Check the CI results and compat in Project.toml

Or do you mean something else?

I missed the scrollbar, sry 😄

test/Project.toml

svilupp added 2 commits November 25, 2024 19:26

update to onnxruntime 1.20.1

e769477

update versions

02ad067

update windows binaries

ee2a51d

update MacOS aarch binaries

2750780

bump cuDNN

7d7140c

update cuDNN versions

9a51ca6

svilupp added 2 commits November 29, 2024 21:35

Update Julia 1.10 compat

ce4a4ca

Update CI to Julia 1.10

35bb83d

GunnarFarneback reviewed Nov 30, 2024

View reviewed changes

test/Project.toml Outdated Show resolved Hide resolved

tweak versions

a5bdaef

GunnarFarneback approved these changes Nov 30, 2024

View reviewed changes

jw3126 merged commit 4a5a2bc into jw3126:main Dec 1, 2024
7 of 10 checks passed

Update onnxruntime to 1.20.1 #40

Update onnxruntime to 1.20.1 #40

Conversation

svilupp commented Nov 25, 2024 • edited Loading

jw3126 commented Nov 26, 2024

svilupp commented Nov 26, 2024

jw3126 commented Nov 26, 2024 • edited Loading

jw3126 commented Nov 26, 2024

svilupp commented Nov 26, 2024

svilupp commented Nov 26, 2024

svilupp commented Nov 27, 2024

jw3126 commented Nov 28, 2024 • edited Loading

svilupp commented Nov 28, 2024

svilupp commented Nov 28, 2024

svilupp commented Nov 28, 2024

jw3126 commented Nov 28, 2024 • edited Loading

svilupp commented Nov 28, 2024 • edited Loading

jw3126 commented Nov 28, 2024

svilupp commented Nov 28, 2024

GunnarFarneback commented Nov 28, 2024

jw3126 commented Nov 29, 2024

svilupp commented Nov 29, 2024

jw3126 commented Nov 29, 2024 • edited Loading

svilupp commented Nov 29, 2024 • edited Loading

jw3126 commented Nov 29, 2024 • edited Loading

svilupp commented Nov 29, 2024 • edited Loading

jw3126 commented Nov 29, 2024

jw3126 commented Nov 29, 2024

jw3126 commented Nov 29, 2024 • edited Loading

GunnarFarneback commented Nov 29, 2024

GunnarFarneback commented Nov 29, 2024

GunnarFarneback commented Nov 29, 2024 • edited Loading

jw3126 commented Nov 29, 2024

GunnarFarneback commented Nov 29, 2024

GunnarFarneback commented Nov 29, 2024

jw3126 commented Nov 29, 2024

jw3126 commented Nov 29, 2024

GunnarFarneback commented Nov 29, 2024

svilupp commented Nov 29, 2024

jw3126 commented Nov 30, 2024

jw3126 commented Nov 30, 2024

svilupp commented Nov 30, 2024

jw3126 commented Nov 30, 2024

svilupp commented Nov 25, 2024 •

edited

Loading

jw3126 commented Nov 26, 2024 •

edited

Loading

jw3126 commented Nov 28, 2024 •

edited

Loading

jw3126 commented Nov 28, 2024 •

edited

Loading

svilupp commented Nov 28, 2024 •

edited

Loading

jw3126 commented Nov 29, 2024 •

edited

Loading

svilupp commented Nov 29, 2024 •

edited

Loading

jw3126 commented Nov 29, 2024 •

edited

Loading

svilupp commented Nov 29, 2024 •

edited

Loading

jw3126 commented Nov 29, 2024 •

edited

Loading

GunnarFarneback commented Nov 29, 2024 •

edited

Loading