Support Cholesky factorization of CuSparseMatrixCSR #1855

shakedregev · 2023-04-04T22:14:37Z

Describe the bug

When you factorize a GPU matrix, the output is on a CPU. I believe
https://github.com/JuliaGPU/CUDA.jl/blob/6d2e9ab2f2a1bb1df7791bb30178e0fe956940a3/lib/cusolver/linalg.jl#L484 is the only mention of Cholesky in the entire package and it looks like it's coded to work this way (just copy to CPU and factorize). This is not desired behavior, as it makes the entire GPU operation pointless.

To reproduce

The Minimal Working Example (MWE) for this bug:

using Random
using SparseArrays
using LinearAlgebra
using CUDA
using CUDA.CUSPARSE

n= 4 #num variables
rng = MersenneTwister(1235) # this is the seed
Q1 = sprand(rng, n, n, 1.0/n)
Q = Q1'*Q1 + I
A = CuSparseMatrixCSR(Q)
cholA = cholesky(A)
b = CUDA.fill(1.0, n)
x = cholA\b
println("Type of cholA: ", typeof(cholA))
println("Type of b ", typeof(b))
println("Type of x ", typeof(x))

Output

Type of cholA: Cholesky{Float64, Matrix{Float64}}
Type of b CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}
Type of x Vector{Float64}

Manifest.toml

# This file is machine-generated - editing it directly is not advised

julia_version = "1.9.0-alpha1"
manifest_format = "2.0"
project_hash = "fe1c204c663e26329f68ccc2911a2d13920bea4e"

[[deps.AbstractFFTs]]
deps = ["ChainRulesCore", "LinearAlgebra"]
git-tree-sha1 = "69f7020bd72f069c219b5e8c236c1fa90d2cb409"
uuid = "621f4979-c628-5d54-868e-fcf4e3e8185c"
version = "1.2.1"

[[deps.Adapt]]
deps = ["LinearAlgebra"]
git-tree-sha1 = "0310e08cb19f5da31d08341c6120c047598f5b9c"
uuid = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
version = "3.5.0"

[[deps.ArgTools]]
uuid = "0dad84c5-d112-42e6-8d28-ef12dabb789f"
version = "1.1.1"

[[deps.Artifacts]]
uuid = "56f22d72-fd6d-98f1-02f0-08ddc0907c33"

[[deps.BFloat16s]]
deps = ["LinearAlgebra", "Printf", "Random", "Test"]
git-tree-sha1 = "dbf84058d0a8cbbadee18d25cf606934b22d7c66"
uuid = "ab4f0b2a-ad5b-11e8-123f-65d77653426b"
version = "0.4.2"

[[deps.Base64]]
uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"

[[deps.CEnum]]
git-tree-sha1 = "eb4cb44a499229b3b8426dcfb5dd85333951ff90"
uuid = "fa961155-64e5-5f13-b03f-caf6b980ea82"
version = "0.4.2"

[[deps.CUDA]]
deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CUDA_Driver_jll", "CUDA_Runtime_Discovery", "CUDA_Runtime_jll", "CompilerSupportLibraries_jll", "ExprTools", "GPUArrays", "GPUCompiler", "LLVM", "LazyArtifacts", "Libdl", "LinearAlgebra", "Logging", "Preferences", "Printf", "Random", "Random123", "RandomNumbers", "Reexport", "Requires", "SparseArrays", "SpecialFunctions"]
git-tree-sha1 = "edff14c60784c8f7191a62a23b15a421185bc8a8"
uuid = "052768ef-5323-5732-b1bb-66c8b64840ba"
version = "4.0.1"

[[deps.CUDA_Driver_jll]]
deps = ["Artifacts", "JLLWrappers", "LazyArtifacts", "Libdl", "Pkg"]
git-tree-sha1 = "75d7896d1ec079ef10d3aee8f3668c11354c03a1"
uuid = "4ee394cb-3365-5eb0-8335-949819d2adfc"
version = "0.2.0+0"

[[deps.CUDA_Runtime_Discovery]]
deps = ["Libdl"]
git-tree-sha1 = "58dd8ec29f54f08c04b052d2c2fa6760b4f4b3a4"
uuid = "1af6417a-86b4-443c-805f-a4643ffb695f"
version = "0.1.1"

[[deps.CUDA_Runtime_jll]]
deps = ["Artifacts", "CUDA_Driver_jll", "JLLWrappers", "LazyArtifacts", "Libdl", "Pkg", "TOML"]
git-tree-sha1 = "d3e6ccd30f84936c1a3a53d622d85d7d3f9b9486"
uuid = "76a88914-d11a-5bdc-97e0-2f5a05c973a2"
version = "0.2.3+2"

[[deps.ChainRulesCore]]
deps = ["Compat", "LinearAlgebra", "SparseArrays"]
git-tree-sha1 = "c6d890a52d2c4d55d326439580c3b8d0875a77d9"
uuid = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4"
version = "1.15.7"

[[deps.ChangesOfVariables]]
deps = ["ChainRulesCore", "LinearAlgebra", "Test"]
git-tree-sha1 = "844b061c104c408b24537482469400af6075aae4"
uuid = "9e997f8a-9a97-42d5-a9f1-ce6bfc15e2c0"
version = "0.1.5"

[[deps.Compat]]
deps = ["Dates", "LinearAlgebra", "UUIDs"]
git-tree-sha1 = "61fdd77467a5c3ad071ef8277ac6bd6af7dd4c04"
uuid = "34da2185-b29b-5c13-b0c7-acf172513d20"
version = "4.6.0"

[[deps.CompilerSupportLibraries_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "e66e0078-7015-5450-92f7-15fbd957f2ae"
version = "1.0.1+0"

[[deps.Dates]]
deps = ["Printf"]
uuid = "ade2ca70-3891-5945-98fb-dc099432e06a"

[[deps.DocStringExtensions]]
deps = ["LibGit2"]
git-tree-sha1 = "2fb1e02f2b635d0845df5d7c167fec4dd739b00d"
uuid = "ffbed154-4ef7-542d-bbb7-c09d3a79fcae"
version = "0.9.3"

[[deps.Downloads]]
deps = ["ArgTools", "FileWatching", "LibCURL", "NetworkOptions"]
uuid = "f43a241f-c20a-4ad4-852c-f6b1247861c6"
version = "1.6.0"

[[deps.ExprTools]]
git-tree-sha1 = "56559bbef6ca5ea0c0818fa5c90320398a6fbf8d"
uuid = "e2ba6199-217a-4e67-a87a-7c52f15ade04"
version = "0.1.8"

[[deps.FileIO]]
deps = ["Pkg", "Requires", "UUIDs"]
git-tree-sha1 = "7be5f99f7d15578798f338f5433b6c432ea8037b"
uuid = "5789e2e9-d7fb-5bc7-8068-2c6fae9b9549"
version = "1.16.0"

[[deps.FileWatching]]
uuid = "7b1f6079-737a-58dc-b8bc-7a2ca5c1b5ee"

[[deps.GPUArrays]]
deps = ["Adapt", "GPUArraysCore", "LLVM", "LinearAlgebra", "Printf", "Random", "Reexport", "Serialization", "Statistics"]
git-tree-sha1 = "4dfaff044eb2ce11a897fecd85538310e60b91e6"
uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7"
version = "8.6.2"

[[deps.GPUArraysCore]]
deps = ["Adapt"]
git-tree-sha1 = "57f7cde02d7a53c9d1d28443b9f11ac5fbe7ebc9"
uuid = "46192b85-c4d5-4398-a991-12ede77f4527"
version = "0.1.3"

[[deps.GPUCompiler]]
deps = ["ExprTools", "InteractiveUtils", "LLVM", "Libdl", "Logging", "TimerOutputs", "UUIDs"]
git-tree-sha1 = "95185985a5d2388c6d0fedb06181ad4ddd40e0cb"
uuid = "61eb1bfa-7361-4325-ad38-22787b887f55"
version = "0.17.2"

[[deps.InteractiveUtils]]
deps = ["Markdown"]
uuid = "b77e0a4c-d291-57a0-90e8-8db25a27a240"

[[deps.InverseFunctions]]
deps = ["Test"]
git-tree-sha1 = "49510dfcb407e572524ba94aeae2fced1f3feb0f"
uuid = "3587e190-3f89-42d0-90ee-14403ec27112"
version = "0.1.8"

[[deps.IrrationalConstants]]
git-tree-sha1 = "7fd44fd4ff43fc60815f8e764c0f352b83c49151"
uuid = "92d709cd-6900-40b7-9082-c6be49f344b6"
version = "0.1.1"

[[deps.JLLWrappers]]
deps = ["Preferences"]
git-tree-sha1 = "abc9885a7ca2052a736a600f7fa66209f96506e1"
uuid = "692b3bcd-3c85-4b1f-b108-f13ce0eb3210"
version = "1.4.1"

[[deps.Krylov]]
deps = ["LinearAlgebra", "Printf", "SparseArrays"]
path = "/home/shakedregev/.julia/dev/Krylov"
uuid = "ba0b0d4f-ebba-5204-a429-3ac8c609bfb7"
version = "0.9.0"

[[deps.LLVM]]
deps = ["CEnum", "LLVMExtra_jll", "Libdl", "Printf", "Unicode"]
git-tree-sha1 = "df115c31f5c163697eede495918d8e85045c8f04"
uuid = "929cbde3-209d-540e-8aea-75f648917ca0"
version = "4.16.0"

[[deps.LLVMExtra_jll]]
deps = ["Artifacts", "JLLWrappers", "LazyArtifacts", "Libdl", "Pkg", "TOML"]
git-tree-sha1 = "771bfe376249626d3ca12bcd58ba243d3f961576"
uuid = "dad2f222-ce93-54a1-a47d-0025e8a3acab"
version = "0.0.16+0"

[[deps.LazyArtifacts]]
deps = ["Artifacts", "Pkg"]
uuid = "4af54fe1-eca0-43a8-85a7-787d91b784e3"

[[deps.LibCURL]]
deps = ["LibCURL_jll", "MozillaCACerts_jll"]
uuid = "b27032c2-a3e7-50c8-80cd-2d36dbcbfd21"
version = "0.6.3"

[[deps.LibCURL_jll]]
deps = ["Artifacts", "LibSSH2_jll", "Libdl", "MbedTLS_jll", "Zlib_jll", "nghttp2_jll"]
uuid = "deac9b47-8bc7-5906-a0fe-35ac56dc84c0"
version = "7.84.0+0"

[[deps.LibGit2]]
deps = ["Base64", "NetworkOptions", "Printf", "SHA"]
uuid = "76f85450-5226-5b5a-8eaa-529ad045b433"

[[deps.LibSSH2_jll]]
deps = ["Artifacts", "Libdl", "MbedTLS_jll"]
uuid = "29816b5a-b9ab-546f-933c-edad1886dfa8"
version = "1.10.2+0"

[[deps.Libdl]]
uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb"

[[deps.LinearAlgebra]]
deps = ["Libdl", "OpenBLAS_jll", "libblastrampoline_jll"]
uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"

[[deps.LogExpFunctions]]
deps = ["ChainRulesCore", "ChangesOfVariables", "DocStringExtensions", "InverseFunctions", "IrrationalConstants", "LinearAlgebra"]
git-tree-sha1 = "680e733c3a0a9cea9e935c8c2184aea6a63fa0b5"
uuid = "2ab3a3ac-af41-5b50-aa03-7779005ae688"
version = "0.3.21"

[[deps.Logging]]
uuid = "56ddb016-857b-54e1-b83d-db4d58db5568"

[[deps.Markdown]]
deps = ["Base64"]
uuid = "d6f4376e-aef5-505a-96c1-9c027394607a"

[[deps.MbedTLS_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "c8ffd9c3-330d-5841-b78e-0817d7145fa1"
version = "2.28.0+0"

[[deps.MozillaCACerts_jll]]
uuid = "14a3606d-f60d-562e-9121-12d972cd8159"
version = "2022.10.11"

[[deps.NPZ]]
deps = ["FileIO", "ZipFile"]
git-tree-sha1 = "60a8e272fe0c5079363b28b0953831e2dd7b7e6f"
uuid = "15e1cf62-19b3-5cfa-8e77-841668bca605"
version = "0.4.3"

[[deps.NetworkOptions]]
uuid = "ca575930-c2e3-43a9-ace4-1e988b2c1908"
version = "1.2.0"

[[deps.OpenBLAS_jll]]
deps = ["Artifacts", "CompilerSupportLibraries_jll", "Libdl"]
uuid = "4536629a-c528-5b80-bd46-f80d51c5b363"
version = "0.3.21+0"

[[deps.OpenLibm_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "05823500-19ac-5b8b-9628-191a04bc5112"
version = "0.8.1+0"

[[deps.OpenSpecFun_jll]]
deps = ["Artifacts", "CompilerSupportLibraries_jll", "JLLWrappers", "Libdl", "Pkg"]
git-tree-sha1 = "13652491f6856acfd2db29360e1bbcd4565d04f1"
uuid = "efe28fd5-8261-553b-a9e1-b2916fc3738e"
version = "0.5.5+0"

[[deps.Pkg]]
deps = ["Artifacts", "Dates", "Downloads", "FileWatching", "LibGit2", "Libdl", "Logging", "Markdown", "Printf", "REPL", "Random", "SHA", "Serialization", "TOML", "Tar", "UUIDs", "p7zip_jll"]
uuid = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"
version = "1.8.0"

[[deps.Preferences]]
deps = ["TOML"]
git-tree-sha1 = "47e5f437cc0e7ef2ce8406ce1e7e24d44915f88d"
uuid = "21216c6a-2e73-6563-6e65-726566657250"
version = "1.3.0"

[[deps.Printf]]
deps = ["Unicode"]
uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7"

[[deps.REPL]]
deps = ["InteractiveUtils", "Markdown", "Sockets", "Unicode"]
uuid = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb"

[[deps.Random]]
deps = ["SHA", "Serialization"]
uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"

[[deps.Random123]]
deps = ["Random", "RandomNumbers"]
git-tree-sha1 = "7a1a306b72cfa60634f03a911405f4e64d1b718b"
uuid = "74087812-796a-5b5d-8853-05524746bad3"
version = "1.6.0"

[[deps.RandomNumbers]]
deps = ["Random", "Requires"]
git-tree-sha1 = "043da614cc7e95c703498a491e2c21f58a2b8111"
uuid = "e6cf234a-135c-5ec9-84dd-332b85af5143"
version = "1.5.3"

[[deps.Reexport]]
git-tree-sha1 = "45e428421666073eab6f2da5c9d310d99bb12f9b"
uuid = "189a3867-3050-52da-a836-e630ba90ab69"
version = "1.2.2"

[[deps.Requires]]
deps = ["UUIDs"]
git-tree-sha1 = "838a3a4188e2ded87a4f9f184b4b0d78a1e91cb7"
uuid = "ae029012-a4dd-5104-9daa-d747884805df"
version = "1.3.0"

[[deps.SHA]]
uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce"
version = "0.7.0"

[[deps.Serialization]]
uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b"

[[deps.Sockets]]
uuid = "6462fe0b-24de-5631-8697-dd941f90decc"

[[deps.SparseArrays]]
deps = ["Libdl", "LinearAlgebra", "Random", "Serialization", "SuiteSparse_jll"]
uuid = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"

[[deps.SpecialFunctions]]
deps = ["ChainRulesCore", "IrrationalConstants", "LogExpFunctions", "OpenLibm_jll", "OpenSpecFun_jll"]
git-tree-sha1 = "d75bda01f8c31ebb72df80a46c88b25d1c79c56d"
uuid = "276daf66-3868-5448-9aa4-cd146d93841b"
version = "2.1.7"

[[deps.Statistics]]
deps = ["LinearAlgebra", "SparseArrays"]
uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
version = "1.9.0"

[[deps.SuiteSparse_jll]]
deps = ["Artifacts", "Libdl", "Pkg", "libblastrampoline_jll"]
uuid = "bea87d4a-7f5b-5778-9afe-8cc45184846c"
version = "5.10.1+0"

[[deps.TOML]]
deps = ["Dates"]
uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76"
version = "1.0.3"

[[deps.Tar]]
deps = ["ArgTools", "SHA"]
uuid = "a4e569a6-e804-4fa4-b0f3-eef7a1d5b13e"
version = "1.10.0"

[[deps.Test]]
deps = ["InteractiveUtils", "Logging", "Random", "Serialization"]
uuid = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[[deps.TimerOutputs]]
deps = ["ExprTools", "Printf"]
git-tree-sha1 = "f2fd3f288dfc6f507b0c3a2eb3bac009251e548b"
uuid = "a759f4b9-e2f1-59dc-863e-4aeb61b1ea8f"
version = "0.5.22"

[[deps.UUIDs]]
deps = ["Random", "SHA"]
uuid = "cf7118a7-6976-5b1a-9a39-7adc72f591a4"

[[deps.Unicode]]
uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"

[[deps.ZipFile]]
deps = ["Libdl", "Printf", "Zlib_jll"]
git-tree-sha1 = "f492b7fe1698e623024e873244f10d89c95c340a"
uuid = "a5390f91-8eb1-5f08-bee0-b1d1ffed6cea"
version = "0.10.1"

[[deps.Zlib_jll]]
deps = ["Libdl"]
uuid = "83775a58-1f1d-513f-b197-d71354ab007a"
version = "1.2.13+0"

[[deps.libblastrampoline_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "8e850b90-86db-534c-a0d3-1478176c7d93"
version = "5.2.0+0"

[[deps.nghttp2_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "8e850ede-7688-5339-a07c-302acd2aaf8d"
version = "1.48.0+0"

[[deps.p7zip_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "3f19e933-33d8-53b3-aaab-bd5110c3b7a0"
version = "17.4.0+0"

Expected behavior

I expect the factorization to stay on the GPU and the solution to also be on the GPU. Meaning
typeof(x)==typeof(b) and typeof(cholA) should be something GPU related.

Version info

Details on Julia:

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × Intel(R) Xeon(R) CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, cascadelake)
  Threads: 1 on 12 virtual cores

Details on CUDA:

# please post the output of:
CUDA runtime 11.8, artifact installation
CUDA driver 11.8
NVIDIA driver 470.161.3, originally for CUDA 11.4

Libraries: 
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 11.0.0+470.161.3

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)```


**Additional context**

[https://discourse.julialang.org/t/complete-and-incomplete-sparse-cholesky-factorization/96918](Posted on discourse). The functionality was clearly available in previous versions of Julia when CuSolver, CuArrays, and CuSparse were separate. Searching the web reveals old documentation with options like csrlvchol! and ic0 which also did not work. My guess is that this functionality was intended to just be wrapped in `cholesky(A)`, but the GPU version was never fully implemented and instead happens with a CPU copy.

The text was updated successfully, but these errors were encountered:

maleadt · 2023-04-05T06:40:44Z

I believe this is the only mention of Cholesky in the entire package and it looks like it's coded to work this way (just copy to CPU and factorize).

That's incorrect; cholesky on dense inputs works fine:

julia> using CUDA, LinearAlgebra

julia> CUDA.allowscalar(false)

julia> A = CUDA.rand(10, 10);
julia> A = A*A'+I;

julia> cholesky(A)
Cholesky{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}
U factor:
10×10 UpperTriangular{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}:
 2.07792  1.04945  1.31206   1.20788    1.16905    1.25365   1.19039   0.792913    1.1288     1.07848
  ⋅       1.794    0.943456  0.393756   0.655787   0.66578   0.537175  0.213479    0.737071   0.155549
  ⋅        ⋅       1.36309   0.394237   0.370576   0.603764  0.570361  0.414203    0.541527   0.262899
  ⋅        ⋅        ⋅        1.38618   -0.0552797  0.39563   0.190204  0.00359829  0.141378   0.322178
  ⋅        ⋅        ⋅         ⋅         1.53461    0.564346  0.130046  0.114851    0.20222    0.332031
  ⋅        ⋅        ⋅         ⋅          ⋅         1.54097   0.301906  0.191423    0.368203   0.00617357
  ⋅        ⋅        ⋅         ⋅          ⋅          ⋅        1.21205   0.350676    0.424222   0.0375677
  ⋅        ⋅        ⋅         ⋅          ⋅          ⋅         ⋅        1.32711     0.19541   -0.0467459
  ⋅        ⋅        ⋅         ⋅          ⋅          ⋅         ⋅         ⋅          1.36569   -0.0209019
  ⋅        ⋅        ⋅         ⋅          ⋅          ⋅         ⋅         ⋅           ⋅         1.1683

A = CuSparseMatrixCSR(Q)
cholA = cholesky(A)

What CUSOLVER/CUSPARSE API calls do you expect this to use?

junyuan-chen · 2023-04-09T17:32:32Z

There is a wrapper for solving a sparse system with Cholesky factorization here:

CUDA.jl/lib/cusolver/sparse.jl

Lines 102 to 135 in 5c51766

    
           for (fname, elty, relty) in ((:cusolverSpScsrlsvchol, :Float32, :Float32), 
        
                                        (:cusolverSpDcsrlsvchol, :Float64, :Float64), 
        
                                        (:cusolverSpCcsrlsvchol, :ComplexF32, :Float32), 
        
                                        (:cusolverSpZcsrlsvchol, :ComplexF64, :Float64)) 
        
               @eval begin 
        
                   function csrlsvchol!(A::CuSparseMatrixCSR{$elty}, 
        
                                        b::CuVector{$elty}, 
        
                                        x::CuVector{$elty}, 
        
                                        tol::$relty, 
        
                                        reorder::Cint, 
        
                                        inda::Char) 
        
                       n      = size(A,1) 
        
                       if size(A,2) != n 
        
                           throw(DimensionMismatch("Cholesky factorization is only possible for square matrices!")) 
        
                       end 
        
                       if size(A,2) != length(b) 
        
                           throw(DimensionMismatch("second dimension of A, $(size(A,2)), must match the length of b, $(length(b))")) 
        
                       end 
        
                       if length(x) != length(b) 
        
                           throw(DimensionMismatch("length of x, $(length(x)), must match the length of b, $(length(b))")) 
        
                       end 
        
                       desca = CuMatrixDescriptor('G', 'L', 'N', inda) 
        
                       singularity = zeros(Cint,1) 
        
                       $fname(sparse_handle(), n, A.nnz, desca, A.nzVal, A.rowPtr, A.colVal, b, tol, reorder, x, singularity) 
        
                       if singularity[1] != -1 
        
                           throw(SingularException(singularity[1])) 
        
                       end 
        
                       x 
        
                   end 
        
               end 
        
           end

It seems that this solver does what cholesky! and ldiv! would do together and hence the factorization part cannot be wrapped by the Julia interface separately.

Also, according to the CUDA documentation here, there is no such thing like cholesky! for sparse matrices from their high-level interface.

shakedregev · 2023-04-11T18:10:08Z

Thanks, but this is not useful in my case. I'm not sure why, but NVIDIA removed the ability to keep your factorizations. The move is good for simplicity when you're only solving one system, but not in many applications such as optimization problems, or conjugate gradient with a preconditioner. Almost anything where you're iterating. Back when I wrote in C++, I just went and used older CUDA versions and they worked fine for me. The newer ones did not have what I needed. It seems (perhaps) that CUDA.jl followed the same move by NVIDIA.

I am solving a system Ax=b, where A is dense and is just given as an operator (product and sum of sparse matrices). I am forming a sparse approximation of A called M which I want to factorize and solve systems with.
I need to be able to keep the factorization to solve multiple systems (because I am using this as a preconditioner in conjugate gradient). I also need the symbolic information, because I have multiple systems that look like Ax=b but the sparsity structure of M is constant.
Basically, I need to be able to do three separate things

Take the symbolic factorization of a csr matrix
Given this symbolic factorization, compute a numeric factorization
Solve systems with said numeric factorization.

This may or may not be useful, but for a matrix M with csr representation (rows, cols, vals), the C++ style CUDA code for what I want to do is

// Symbolic factorization - happens once
  csrcholInfo_t info = NULL;
  cusolverSpCreateCsrcholInfo(&info);
  cusolverSpXcsrcholAnalysis(handle_cusolver, M->n, nnzM, descrM, rows, cols, info);
  size_t internalDataInBytes, workspaceInBytes;
  cusolverSpDcsrcholBufferInfo(handle_cusolver, M->n, nnzM, descrM, vals, rows,
   cols, info, &internalDataInBytes, &workspaceInBytes);
  void* buffer_gpu = NULL;
  cudaMalloc(&buffer_gpu, sizeof(char) * workspaceInBytes);
  // Numerical factorization - happens for every different `Ax=b` system for the different `M`
  cusolverSpDcsrcholFactor(
    handle_cusolver, M->n, nnzM, descrM, vals, rows, cols, info, buffer_gpu);
  cusolverSpDcsrcholZeroPivot(handle_cusolver, info, tol, &singularity);
// Solve that happens every conjugate gradient iteration
`cusolverSpDcsrcholSolve(handle_cusolver, M->n, x, b, info, buffer_gpu);`

shakedregev · 2023-04-11T18:39:02Z

Actually, an incomplete factorization would probably suffice for my purposes, but ic02 returns a matrix that I cannot make heads or tails of. It's not lower triangular or symmetric, so not sure how it incorporates the LL^T factors

Although still I would argue that a full factorization should be doable. In many cases a full Cholesky factorization stays sparse.

shakedregev · 2023-04-19T23:54:41Z

@maleadt would you be willing to develop this functionality together so it can be part of the library? I can take care of the linear algebra, I just don't have experience with developing backend Julia functions (I do have experience with CUDA though)

maleadt · 2023-04-24T11:56:34Z

I'm totally unfamiliar with what you're trying to accomplish, so I can only offer high-level guidance. Since you know which API calls you want to perform; what we normally do first, is build a slightly more generic wrapper around these APIs, e.g.,

CUDA.jl/lib/cusolver/sparse.jl

Lines 101 to 135 in b3c6be4

    
           #csrlsvchol 
        
           for (fname, elty, relty) in ((:cusolverSpScsrlsvchol, :Float32, :Float32), 
        
                                        (:cusolverSpDcsrlsvchol, :Float64, :Float64), 
        
                                        (:cusolverSpCcsrlsvchol, :ComplexF32, :Float32), 
        
                                        (:cusolverSpZcsrlsvchol, :ComplexF64, :Float64)) 
        
               @eval begin 
        
                   function csrlsvchol!(A::CuSparseMatrixCSR{$elty}, 
        
                                        b::CuVector{$elty}, 
        
                                        x::CuVector{$elty}, 
        
                                        tol::$relty, 
        
                                        reorder::Cint, 
        
                                        inda::Char) 
        
                       n      = size(A,1) 
        
                       if size(A,2) != n 
        
                           throw(DimensionMismatch("Cholesky factorization is only possible for square matrices!")) 
        
                       end 
        
                       if size(A,2) != length(b) 
        
                           throw(DimensionMismatch("second dimension of A, $(size(A,2)), must match the length of b, $(length(b))")) 
        
                       end 
        
                       if length(x) != length(b) 
        
                           throw(DimensionMismatch("length of x, $(length(x)), must match the length of b, $(length(b))")) 
        
                       end 
        
                       desca = CuMatrixDescriptor('G', 'L', 'N', inda) 
        
                       singularity = zeros(Cint,1) 
        
                       $fname(sparse_handle(), n, A.nnz, desca, A.nzVal, A.rowPtr, A.colVal, b, tol, reorder, x, singularity) 
        
                       if singularity[1] != -1 
        
                           throw(SingularException(singularity[1])) 
        
                       end 
        
                       x 
        
                   end 
        
               end 
        
           end

Then, we use these wrappers to write methods that integrate with interfaces/stdlibs/packages like LinearAlgebra.jl, e.g.,

CUDA.jl/lib/cusolver/linalg.jl

Lines 481 to 485 in b3c6be4

    
           # LinearAlgebra.jl defines a comparable method with these restrictions on T, so we need 
        
           #   to define with the same type parameters to resolve method ambiguity for these cases 
        
           function LinearAlgebra.ldiv!(F::LU{T,<:StridedCuMatrix{T}}, B::CuVecOrMat{T}) where {T <: Union{Float32, Float64, ComplexF32, ComplexF64}} 
        
               return getrs!('N', F.factors, F.ipiv, B) 
        
           end

loonatick-src · 2023-08-03T10:52:05Z

Thanks, but this is not useful in my case. I'm not sure why, but NVIDIA removed the ability to keep your factorizations. The move is good for simplicity when you're only solving one system, but not in many applications such as optimization problems, or conjugate gradient with a preconditioner. Almost anything where you're iterating. Back when I wrote in C++, I just went and used older CUDA versions and they worked fine for me. The newer ones did not have what I needed. It seems (perhaps) that CUDA.jl followed the same move by NVIDIA.

I am solving a system Ax=b, where A is dense and is just given as an operator (product and sum of sparse matrices). I am forming a sparse approximation of A called M which I want to factorize and solve systems with. I need to be able to keep the factorization to solve multiple systems (because I am using this as a preconditioner in conjugate gradient). I also need the symbolic information, because I have multiple systems that look like Ax=b but the sparsity structure of M is constant. Basically, I need to be able to do three separate things
1. Take the symbolic factorization of a csr matrix

2. Given this symbolic factorization, compute a numeric factorization

3. Solve systems with said numeric factorization.
This may or may not be useful, but for a matrix M with csr representation (rows, cols, vals), the C++ style CUDA code for what I want to do is
// Symbolic factorization - happens once
  csrcholInfo_t info = NULL;
  cusolverSpCreateCsrcholInfo(&info);
  cusolverSpXcsrcholAnalysis(handle_cusolver, M->n, nnzM, descrM, rows, cols, info);
  size_t internalDataInBytes, workspaceInBytes;
  cusolverSpDcsrcholBufferInfo(handle_cusolver, M->n, nnzM, descrM, vals, rows,
   cols, info, &internalDataInBytes, &workspaceInBytes);
  void* buffer_gpu = NULL;
  cudaMalloc(&buffer_gpu, sizeof(char) * workspaceInBytes);
  // Numerical factorization - happens for every different `Ax=b` system for the different `M`
  cusolverSpDcsrcholFactor(
    handle_cusolver, M->n, nnzM, descrM, vals, rows, cols, info, buffer_gpu);
  cusolverSpDcsrcholZeroPivot(handle_cusolver, info, tol, &singularity);
// Solve that happens every conjugate gradient iteration
`cusolverSpDcsrcholSolve(handle_cusolver, M->n, x, b, info, buffer_gpu);`

@shakedregev Using cusolverSpDcsrcholFactor etc. for exposing sparse cholesky factorizations and solves might be problematic since they appear to be undocumented, low-level API [1,2], and I cannot tell if they will be stable or maintained.

It might be likely they won't be supported since the CUDA component of CHOLMOD from SuiteSparse seems to be the officially supported sparse Cholesky library for CUDA. I guess this would require depending on SuiteSparse.

@maleadt I would like to contribute, but I imagine this particular direction would be more involved because of the dependence of SuiteSparse. Would you be open to contributions in this direction, and offering some high-level guidance?

maleadt · 2023-08-08T10:31:54Z

It might be likely they won't be supported since the CUDA component of CHOLMOD from SuiteSparse seems to be the officially supported sparse Cholesky library for CUDA. I guess this would require depending on SuiteSparse.

@maleadt I would like to contribute, but I imagine this particular direction would be more involved because of the dependence of SuiteSparse. Would you be open to contributions in this direction, and offering some high-level guidance?

Sure, happy to offer some guidance. I'd think that this functionality would need to be a part of SuiteSparse.jl, ideally as a package extension that gets loaded when CUDA.jl is available. We've already built some artifacts, https://github.com/JuliaPackaging/Yggdrasil/tree/master/S/SuiteSparse/SuiteSparse_GPU%405, but they aren't used.

amontoison · 2023-10-30T09:20:39Z

@shakedregev
I interfaced the low-level API for the sparse QR and sparse Cholesky decompositions in #2121.

shakedregev · 2023-10-30T22:44:57Z

@amontoison - Thank you!

maleadt · 2023-10-31T08:22:40Z

Can this be closed then, or is there more to this issue?

loonatick-src · 2023-10-31T14:59:01Z

CHOLMOD support for CUDA is very likely to be implemented as a package extension to the standard library / SuiteSparse (JuliaSparse/SparseArrays.jl#443). So, I think there should not be more to this issue and can be closed as completed.

shakedregev added the bug Something isn't working label Apr 4, 2023

maleadt added enhancement New feature or request help wanted Extra attention is needed cuda libraries Stuff about CUDA library wrappers. and removed bug Something isn't working labels Apr 5, 2023

maleadt changed the title ~~Cholesky factorization of CuSparseMatrixCSR in CUDA.jl happening on CPU~~ Support Cholesky factorization of CuSparseMatrixCSR Apr 5, 2023

loonatick-src mentioned this issue Sep 20, 2023

GPU component of CHOLMOD as package extension JuliaSparse/SparseArrays.jl#443

Open

maleadt closed this as completed Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Cholesky factorization of CuSparseMatrixCSR #1855

Support Cholesky factorization of CuSparseMatrixCSR #1855

shakedregev commented Apr 4, 2023 •

edited

Loading

maleadt commented Apr 5, 2023

junyuan-chen commented Apr 9, 2023 •

edited

Loading

shakedregev commented Apr 11, 2023 •

edited

Loading

shakedregev commented Apr 11, 2023 •

edited

Loading

shakedregev commented Apr 19, 2023

maleadt commented Apr 24, 2023

loonatick-src commented Aug 3, 2023 •

edited

Loading

maleadt commented Aug 8, 2023

amontoison commented Oct 30, 2023

shakedregev commented Oct 30, 2023

maleadt commented Oct 31, 2023

loonatick-src commented Oct 31, 2023

Support Cholesky factorization of CuSparseMatrixCSR #1855

Support Cholesky factorization of CuSparseMatrixCSR #1855

Comments

shakedregev commented Apr 4, 2023 • edited Loading

maleadt commented Apr 5, 2023

junyuan-chen commented Apr 9, 2023 • edited Loading

shakedregev commented Apr 11, 2023 • edited Loading

shakedregev commented Apr 11, 2023 • edited Loading

shakedregev commented Apr 19, 2023

maleadt commented Apr 24, 2023

loonatick-src commented Aug 3, 2023 • edited Loading

maleadt commented Aug 8, 2023

amontoison commented Oct 30, 2023

shakedregev commented Oct 30, 2023

maleadt commented Oct 31, 2023

loonatick-src commented Oct 31, 2023

shakedregev commented Apr 4, 2023 •

edited

Loading

junyuan-chen commented Apr 9, 2023 •

edited

Loading

shakedregev commented Apr 11, 2023 •

edited

Loading

shakedregev commented Apr 11, 2023 •

edited

Loading

loonatick-src commented Aug 3, 2023 •

edited

Loading