Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update onnxruntime to 1.20.1 #40

Merged
merged 9 commits into from
Dec 1, 2024
Merged

Update onnxruntime to 1.20.1 #40

merged 9 commits into from
Dec 1, 2024

Conversation

svilupp
Copy link
Contributor

@svilupp svilupp commented Nov 25, 2024

First of all, thank you for this amazing package!!

I've hit some issues with the outdated binary we were using (I needed IR v10), so I've updated the repo accordingly.
In addition, I've also changed the macos aarch64 to the correct binary - they now produce a native one.

TODO list
[x] Update all artifacts links to 1.20.1, update SHA1 and SHA256 values
[x] Update src/versions.jl to Cuda 12.0 (as per the announcement of onnxruntime 1.19: "Default GPU packages use CUDA 12.x and Cudnn 9.x (previously CUDA 11.x/CuDNN 8.x) CUDA 11.x/CuDNN 8.x packages are moved to the aiinfra VS feed.")
[x] Update the reference to Cuda 11.8 on the README page
[x] Ran tests (locally) -- all passed
[x] Verified that the package loads the model with IR10 and all works

@jw3126
Copy link
Owner

jw3126 commented Nov 26, 2024

Thanks a lot! I can help later with the windows CI fail. The issue is that the artifact system only understands tar, while the official binaries are zip.

@svilupp
Copy link
Contributor Author

svilupp commented Nov 26, 2024

Ah, got it! Thank you

@jw3126
Copy link
Owner

jw3126 commented Nov 26, 2024

I updated the binaries. Also I changed the osx URL, to universal2. Is that the right thing? I would like this to run out of the box on say 4 year old macs. I assume the arm binaries don't give you that, but am not a mac user and interested in your comments.

@jw3126
Copy link
Owner

jw3126 commented Nov 26, 2024

If the osx platform needs to be tweaked further, can you do it in a PR that updates https://github.com/jw3126/ONNXRunTimeArtifacts

@svilupp
Copy link
Contributor Author

svilupp commented Nov 26, 2024

I updated the binaries. Also I changed the osx URL, to universal2. Is that the right thing? I would like this to run out of the box on say 4 year old macs. I assume the arm binaries don't give you that, but am not a mac user and interested in your comments.

The artifacts are platform-specific, so old macs are on x86 with universal tarball. New macs (aarch64) are on their specific tarball. Both will work as they should.
I'm reverting the change back as I had it.

@svilupp
Copy link
Contributor Author

svilupp commented Nov 26, 2024

I went through the CUDA test failures and I think they trace back to GPUCompiler.jl.

It seems to be a bug which has been resolved in newer versions: https://github.com/JuliaGPU/GPUCompiler.jl/blob/09b4708ba12e0b19e40f85c64e9105cf666c4d62/src/GPUCompiler.jl#L60C2-L63C64

It has this block suggesting it's a known issue.

if pkgver !== nothing
# XXX: Base.pkgversion is buggy and sometimes returns nothing, see e.g.
# JuliaLang/PackageCompiler.jl#896 and JuliaGPU/GPUCompiler.jl#593
dir = joinpath(dir, "v$(pkgver.major).$(pkgver.minor)")
end

I think we can ignore it.

@svilupp
Copy link
Contributor Author

svilupp commented Nov 27, 2024

Do you have any further thoughts on the PR?

@jw3126
Copy link
Owner

jw3126 commented Nov 28, 2024

Thanks!

@svilupp
Copy link
Contributor Author

svilupp commented Nov 28, 2024

I'll update the script.

For the error, it's not solvable by us since a dependency fails.
I believe the solution is to update the dep tree - we are at GPUcompiler v0.25 and there is already 1.0. Somewhere in between those is a fix (I linked it above)

I'm happy to experiment with it, but can you give me permission to run the CI? I don't have GPU so can't reproduce it and it will take forever if I can't iterate it quickly.

@svilupp
Copy link
Contributor Author

svilupp commented Nov 28, 2024

Saving my notes here.
The bug was fixed on 4th July: JuliaGPU/GPUCompiler.jl#594

which would make the fixed release GPUCompiler 0.26.7

@svilupp
Copy link
Contributor Author

svilupp commented Nov 28, 2024

@jw3126 Why do we have CuDNN dep in this package? I understand CUDA for the extension but not CuDNN.

I'm fairly confident that the failing version of GPUCompiler is enforced by CuDNN version.

It seems that it's phased out: https://github.com/JuliaAttic/CUDNN.jl

They suggest to use CuArrays.jl, which is fully absorbed by CUDA.jl

So all in all, I'd suggest removing CuDNN dep instead of tweaking the versions.

@jw3126
Copy link
Owner

jw3126 commented Nov 28, 2024

@jw3126 Why do we have CuDNN dep in this package? I understand CUDA for the extension but not CuDNN.

Would love to get rid of that. In the past we had issues with libcudnn (or similar name) not found by onnxruntime and this is how we dealt with it. Nowadays there may or may not be better ways. If you can make linux (or windows) GPU support work with a lighter dependency that would be awesome.

@svilupp
Copy link
Contributor Author

svilupp commented Nov 28, 2024

I don't know enough to guarantee that it works and no way to test it.

As a first pass, I've bumped up cuDNN to 1.3, which hits the patch version 1.3.2 -> has the fixed GPUCompiler v0.26.7

You can trigger the CI at your convenience

@jw3126
Copy link
Owner

jw3126 commented Nov 28, 2024

Thanks I will locally check if it works on linux later

@svilupp
Copy link
Contributor Author

svilupp commented Nov 28, 2024

The current errors are CUDA.jl + Julia 1.12 related. I don't think it's supported yet, see their own CI: https://buildkite.com/julialang/cuda-dot-jl/builds/5531#0193683c-531a-4f7e-ad30-23e4e167be72

@GunnarFarneback
Copy link
Collaborator

Why do we have CuDNN dep in this package? I understand CUDA for the extension but not CuDNN.

Both the CUDA and cuDNN weak dependencies are sort of fake. We really only depend on them in order to get the right artifacts loaded, including libcudnn, so that libonnxruntime can link to them.

@jw3126
Copy link
Owner

jw3126 commented Nov 29, 2024

I checked and could not get this branch to work locally. libcudnn does still not ship with CUDA.jl.

@svilupp
Copy link
Contributor Author

svilupp commented Nov 29, 2024

Ah, that's a shame. Thanks for trying!
I guess I'll keep using my fork.

Btw why do we need to pass CI for nightly builds? Shouldn't we check 1.11 instead?

@jw3126
Copy link
Owner

jw3126 commented Nov 29, 2024

Yeah sorry, but convnets are pretty common, we cant break them. Julia 1 is always the current release (1.11 right now). Breaking nightly is not a merge blocker.

@svilupp
Copy link
Contributor Author

svilupp commented Nov 29, 2024

Ah, yes, of course! I overlooked the Julia 1 CI on my phone. So everything is passing besides the nightly? That’s good!

So will this be merged after all? I took your comment as though it doesn’t work, but CI looks good.

@jw3126
Copy link
Owner

jw3126 commented Nov 29, 2024

It will not be merged. CI passes, because we have no GPU coverage (I don't know a way to run GPU CI for free). But my local testing shows GPU does not work because there is no libcudnn. So if we merged this, we would break GPU support.

@svilupp
Copy link
Contributor Author

svilupp commented Nov 29, 2024

Got it! Thanks for explaining.

Is there value in updating just the macos artifact on 1.15 in current version? It wouldn't help me, but at least it's native - a lot of Julia community have arm macs.

I don't fully understand the failures, are there any next steps (sth to do) or just sit and wait?

@jw3126
Copy link
Owner

jw3126 commented Nov 29, 2024

@jw3126
Copy link
Owner

jw3126 commented Nov 29, 2024

Is there value in updating just the macos artifact on 1.15 in current version? It wouldn't help me, but at least it's native - a lot of Julia community have arm macs.

Sure good suggestion.

@jw3126
Copy link
Owner

jw3126 commented Nov 29, 2024

I am not exactly keen on this, but making onnxruntime version a preference is something one can think about.

@GunnarFarneback
Copy link
Collaborator

Presumably we could target the CUDA artifacts directly rather than the higher level packages, but first we would need to find where the relevant libraries are. I'll have a look at what has changed in the CUDA packaging.

@GunnarFarneback
Copy link
Collaborator

It seems that it's phased out: https://github.com/JuliaAttic/CUDNN.jl

That repository and any information there is irrelevant. The current cuDNN package lives here: https://github.com/JuliaGPU/CUDA.jl/tree/master/lib/cudnn.

@GunnarFarneback
Copy link
Collaborator

GunnarFarneback commented Nov 29, 2024

As far as I can tell nothing has changed in the CUDA packaging recently. I think the only update needed is

diff --git a/test/LocalPreferences.toml b/test/LocalPreferences.toml
index 5da06c7..ef09e81 100644
--- a/test/LocalPreferences.toml
+++ b/test/LocalPreferences.toml
@@ -1,2 +1,2 @@
 [CUDA_Runtime_jll]
-version = "11.8"
+version = "12.0"

which effectively is what happens if you run CUDA.set_runtime_version!(v"12.0") in the test environment. Possibly we can also loosen the cuDNN compat.

We get the right libcudnn (for this version of libonnxruntime) from cuDNN versions 1.3.1, 1.3.2, and 1.4.0 (latest). We have no idea when libcudnn will be bumped next though (last time it changed version between cuDNN 1.3.0 and 1.3.1) so we should probably set cuDNN compat to either the conservative "~1.3.1, =1.4.0" or the more optimistic "~1.3.1, ~1.4".

For good measure the cuDNN compat in test/Project.toml should also be bumped to "1.4", although it doesn't matter in practice.

@jw3126
Copy link
Owner

jw3126 commented Nov 29, 2024

@GunnarFarneback thanks! Did you try that? I still get:

  /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1539 onnxruntime::Provider& onnxr
untime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime
_providers_cuda.so with error: libcudnn.so.9: cannot open shared object file: No such file or direct
ory

with your suggestion.

@GunnarFarneback
Copy link
Collaborator

Yes, with the LocalPreferences update and compat that resolves to a cuDNN 1.3.1 or higher it works locally for me.

@GunnarFarneback
Copy link
Collaborator

What versions of cuDNN and CUDNN_jll do you see in the environment when running the tests?

@jw3126
Copy link
Owner

jw3126 commented Nov 29, 2024

I just saw this error:

┌ Error: cuDNN is not available for your platform (x86_64-linux-gnu-libgfortran5-cxx11-libstdcxx30-c
uda+none-julia_version+1.11.1)
└ @ cuDNN ~/.julia/packages/cuDNN/P9S4N/src/cuDNN.jl:177
  [02a925ec] cuDNN v1.4.0
  [4ee394cb] CUDA_Driver_jll v0.10.4+0
  [76a88914] CUDA_Runtime_jll v0.15.5+0
  [62b44479] CUDNN_jll v9.4.0+0

@jw3126
Copy link
Owner

jw3126 commented Nov 29, 2024

I rebooted and recreated the environment. Tests pass, but I get the following warning:

2024-11-29 15:48:12.845201050 [W:onnxruntime:defaultenv, conv.cc:425 UpdateState] OP Conv(Conv_0) ru
nning in Fallback mode. May be extremely slow.
(ONNXRunTime) pkg> test
     Testing ONNXRunTime
      Status `/tmp/jl_6OAx3Y/Project.toml`
  [052768ef] CUDA v5.5.2
  [e034b28e] ONNXRunTime v1.3.0 `~/.julia/dev/ONNXRunTime`
  [02a925ec] cuDNN v1.4.0
  [8dfed614] Test v1.11.0
      Status `/tmp/jl_6OAx3Y/Manifest.toml`
  [621f4979] AbstractFFTs v1.5.0
  [79e6a3ab] Adapt v4.1.1
  [dce04be8] ArgCheck v2.4.0
⌅ [a9b6321e] Atomix v0.1.0
  [ab4f0b2a] BFloat16s v0.5.0
  [fa961155] CEnum v0.5.0
  [052768ef] CUDA v5.5.2
  [1af6417a] CUDA_Runtime_Discovery v0.3.5
  [3da002f7] ColorTypes v0.12.0
  [5ae59095] Colors v0.13.0
  [34da2185] Compat v4.16.0
  [a8cc5b0e] Crayons v4.1.1
  [9a962f9c] DataAPI v1.16.0
  [a93c6f00] DataFrames v1.7.0
  [864edb3b] DataStructures v0.18.20
  [e2d170a0] DataValueInterfaces v1.0.0
  [ffbed154] DocStringExtensions v0.9.3
  [e2ba6199] ExprTools v0.1.10
  [53c48c17] FixedPointNumbers v0.8.5
⌅ [0c68f7d7] GPUArrays v10.3.1
⌅ [46192b85] GPUArraysCore v0.1.6
⌅ [61eb1bfa] GPUCompiler v0.27.8
  [842dd82b] InlineStrings v1.4.2
  [41ab1584] InvertedIndices v1.3.0
  [82899510] IteratorInterfaceExtensions v1.0.0
  [692b3bcd] JLLWrappers v1.6.1
  [63c18a36] KernelAbstractions v0.9.29
  [929cbde3] LLVM v9.1.3
  [8b046642] LLVMLoopInfo v1.0.0
  [b964fa9f] LaTeXStrings v1.4.0
  [1914dd2f] MacroTools v0.5.13
  [e1d29d7a] Missings v1.2.0
  [5da4648a] NVTX v0.3.5
  [e034b28e] ONNXRunTime v1.3.0 `~/.julia/dev/ONNXRunTime`
  [bac558e1] OrderedCollections v1.7.0
  [2dfb63ee] PooledArrays v1.4.3
  [aea7be01] PrecompileTools v1.2.1
  [21216c6a] Preferences v1.4.3
  [08abe8d2] PrettyTables v2.4.0
  [74087812] Random123 v1.7.0
  [e6cf234a] RandomNumbers v1.6.0
  [189a3867] Reexport v1.2.2
  [ae029012] Requires v1.3.0
  [6c6a2e73] Scratch v1.2.1
  [91c51154] SentinelArrays v1.4.7
  [a2af1166] SortingAlgorithms v1.2.1
  [90137ffa] StaticArrays v1.9.8
  [1e83bf80] StaticArraysCore v1.4.3
  [10745b16] Statistics v1.11.1
  [892a3eda] StringManipulation v0.4.0
  [3783bdb8] TableTraits v1.0.1
  [bd369af6] Tables v1.12.0
  [a759f4b9] TimerOutputs v0.5.25
  [013be700] UnsafeAtomics v0.2.1
  [d80eeb9a] UnsafeAtomicsLLVM v0.2.1
  [02a925ec] cuDNN v1.4.0
  [4ee394cb] CUDA_Driver_jll v0.10.4+0
  [76a88914] CUDA_Runtime_jll v0.15.5+0
  [62b44479] CUDNN_jll v9.4.0+0
  [9c1d0b0a] JuliaNVTXCallbacks_jll v0.2.1+0
  [dad2f222] LLVMExtra_jll v0.0.34+0
  [e98f9f5b] NVTX_jll v3.1.0+2
  [1e29f10c] demumble_jll v1.3.0+0
  [0dad84c5] ArgTools v1.1.2
  [56f22d72] Artifacts v1.11.0
  [2a0f44e3] Base64 v1.11.0
  [ade2ca70] Dates v1.11.0
  [f43a241f] Downloads v1.6.0
  [7b1f6079] FileWatching v1.11.0
  [9fa8497b] Future v1.11.0
  [b77e0a4c] InteractiveUtils v1.11.0
  [4af54fe1] LazyArtifacts v1.11.0
  [b27032c2] LibCURL v0.6.4
  [76f85450] LibGit2 v1.11.0
  [8f399da3] Libdl v1.11.0
  [37e2e46d] LinearAlgebra v1.11.0
  [56ddb016] Logging v1.11.0
  [d6f4376e] Markdown v1.11.0
  [ca575930] NetworkOptions v1.2.0
  [44cfe95a] Pkg v1.11.0
  [de0858da] Printf v1.11.0
  [9a3f8284] Random v1.11.0
  [ea8e919c] SHA v0.7.0
  [9e88b42a] Serialization v1.11.0
  [2f01184e] SparseArrays v1.11.0
  [fa267f1f] TOML v1.0.3
  [a4e569a6] Tar v1.10.0
  [8dfed614] Test v1.11.0
  [cf7118a7] UUIDs v1.11.0
  [4ec0a83e] Unicode v1.11.0
  [e66e0078] CompilerSupportLibraries_jll v1.1.1+0
  [deac9b47] LibCURL_jll v8.6.0+0
  [e37daf67] LibGit2_jll v1.7.2+0
  [29816b5a] LibSSH2_jll v1.11.0+1
  [c8ffd9c3] MbedTLS_jll v2.28.6+0
  [14a3606d] MozillaCACerts_jll v2023.12.12
  [4536629a] OpenBLAS_jll v0.3.27+1
  [bea87d4a] SuiteSparse_jll v7.7.0+0
  [83775a58] Zlib_jll v1.2.13+1
  [8e850b90] libblastrampoline_jll v5.11.0+0
  [8e850ede] nghttp2_jll v1.59.0+0
  [3f19e933] p7zip_jll v17.4.0+2
        Info Packages marked with ⌅ have new versions available but compatibility constraints restri
ct them from upgrading.
     Testing Running tests...
Test Summary:               | Pass  Total  Time
ONNXRunTime library version |    1      1  0.2s
Test Summary:                           | Pass  Total  Time
Minimum CUDA runtime version in README. |    3      3  0.0s
Test Summary: | Pass  Total  Time
high level    |  112    112  1.4s
Test Summary: | Pass  Total  Time
Session       |   25     25  0.5s
Test Summary:    | Pass  Total  Time
tensor roundtrip |    9      9  0.1s
2024-11-29 15:48:12.845201050 [W:onnxruntime:defaultenv, conv.cc:425 UpdateState] OP Conv(Conv_0) ru
nning in Fallback mode. May be extremely slow.
Test Summary:   | Pass  Total  Time
CUDA high level |   22     22  0.3s
Test Summary:  | Pass  Total  Time
CUDA low level |   11     11  0.1s
     Testing ONNXRunTime tests passed 

@GunnarFarneback
Copy link
Collaborator

I get that warning too. It happens in https://github.com/jw3126/ONNXRunTime.jl/blob/main/test/test_cuda.jl#L35 when conv_search is :DEFAULT. It seems like a regression in the CUDA execution provider, but generally I guess the execution providers are allowed to fall back to CPU if they feel a need to. I don't think we can do much about it other than revising our tests.

@svilupp
Copy link
Contributor Author

svilupp commented Nov 29, 2024

Oh, that's an amazing progress! Thank you both for looking into it.

I checked the 1.9 failures -- it's because CuDNN 1.4 dropped support for it: https://github.com/JuliaGPU/CUDA.jl/blob/7ff012f21ecaf9364a348289a136deebe299e8d9/lib/cudnn/Project.toml#L17

@jw3126
Copy link
Owner

jw3126 commented Nov 30, 2024

@GunnarFarneback does this PR look good to you? In particular any other places that need version adjustments?

@jw3126
Copy link
Owner

jw3126 commented Nov 30, 2024

Oh, that's an amazing progress! Thank you both for looking into it.

I checked the 1.9 failures -- it's because CuDNN 1.4 dropped support for it: https://github.com/JuliaGPU/CUDA.jl/blob/7ff012f21ecaf9364a348289a136deebe299e8d9/lib/cudnn/Project.toml#L17

Could you add 1.10 CI, just so that we are conscious which is our min julia version. If it does not run on 1.10 that is also fine, just bump to 1.11, even if that means running CI twice.

@svilupp
Copy link
Contributor Author

svilupp commented Nov 30, 2024

I've done that yesterday night. Check the CI results and compat in Project.toml

Or do you mean something else?

@jw3126
Copy link
Owner

jw3126 commented Nov 30, 2024

I've done that yesterday night. Check the CI results and compat in Project.toml

Or do you mean something else?

I missed the scrollbar, sry 😄
image

test/Project.toml Outdated Show resolved Hide resolved
@jw3126 jw3126 merged commit 4a5a2bc into jw3126:main Dec 1, 2024
7 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants