make extract_gradient[_chunk]! GPU compatible #619

marius311 · 2022-12-20T20:14:55Z

Fixes this:

ForwardDiff.gradient(sum, cu(rand(10))) # Error: scalar indexing

My adhoc tests suggest no performance impact on 1.8.3. for result::Vector, however, there is an extra (2 allocations: 80 bytes) for result::Matrix I think due to a reshape/view thing.

I'm not sure if there's some other form of the code that works without scalar indexing and without this allocation, but open to suggestions. Another alternative is to write some gluecode here or in CUDA.jl that handles CuArray specifically, but that seem more brittle.

I'd vote this solution is OK (pending other feedback) for the benefit of CUDA compatibility with generic code, but thats just IMO.

codecov-commenter · 2022-12-20T20:28:10Z

Codecov Report

Base: 86.83% // Head: 89.74% // Increases project coverage by +2.90% 🎉

Coverage data is based on head (9dc9769) compared to base (6a19554).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #619      +/-   ##
==========================================
+ Coverage   86.83%   89.74%   +2.90%     
==========================================
  Files           9       10       +1     
  Lines         889      965      +76     
==========================================
+ Hits          772      866      +94     
+ Misses        117       99      -18

Impacted Files	Coverage Δ
src/gradient.jl	`98.92% <100.00%> (+0.06%)`	⬆️
src/apiutils.jl	`100.00% <0.00%> (ø)`
src/ForwardDiff.jl	`100.00% <0.00%> (ø)`
src/jacobian.jl	`99.34% <0.00%> (+0.03%)`	⬆️
src/derivative.jl	`94.87% <0.00%> (+0.13%)`	⬆️
src/partials.jl	`84.34% <0.00%> (+0.13%)`	⬆️
src/config.jl	`87.27% <0.00%> (+0.23%)`	⬆️
src/dual.jl	`82.77% <0.00%> (+3.01%)`	⬆️
src/prelude.jl	`93.93% <0.00%> (+56.43%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

mcabbott

This seems surely harmless.

Maybe it would be worth adding some tests, using JLArrays so that they can run on githun?

src/gradient.jl

marius311 · 2022-12-20T22:24:41Z

Added a barebones set of JLArray tests.

src/gradient.jl

marius311 · 2022-12-22T08:24:41Z

Ok, found the magic code which gives no allocation or speed regressions vs. current master for Arrays of any dimensions and which works with CuArrays, its this:

map!(view(Base.ReshapedArray(result, (length(result),), ()), index:index+chunksize-1), 1:chunksize) do i
    @inbounds partials(T, dual, i)
end

Trick is indeed as @mcabbott you suggested to use a ReshapedArray which unlike reshape is non-allocating for Arrays (TIL). And CUDA.jl is smart enough to dispatch on view(ReshapedArray(CuArray)) and do its thing.

Squashed into a single commit. Ready on my end.

marius311 · 2023-01-03T07:28:21Z

Don't want to have this lost after the holidays, could this get looked at again?

devmotion

To me this seems still mainly a CUDA issue. Changing the definitions to copyto! operating on Tuples seems the most direct and natural implementation to me, and one that I would expect to work for any array. Whereas to me the code in this PR seems to be a bit non-intuitive and solely due to CUDA. So I would suggest

using copyto! with Tuples (as suggested in more detail above)
adding definitions of copyto! with Tuple to CUDA

marius311 · 2023-01-05T21:03:45Z

adding definitions of copyto! with Tuple to CUDA

That seems like a strech to me with CUDA tbh. Although these things aren't specified perfectly, copyto! seems aimed to be an array interface, not really for tuples. I would expect such a PR not to be accepted in CUDA, but we can try.

In the meantime, this PR has no regressions but enables GPU, so can it still be merged? And we can revist later if we get the CUDA interface?

marius311 · 2023-01-24T19:12:43Z

Here's an issue with the copyto! option. How do you implement extract_gradient_chunk!(::Type{T}, result::AbstractArray, y::Real, index, chunksize) so thats its inferred and allocation free for Arrays? Closest I can come up with is:

copyto!(result, index, ntuple(i->0, Val(chunksize)), 1, chunksize)

which is not inferred, although the Julia compiler must work some magic bc its allocation free. But its 2X slower than my solution in this PR.

So in lieu of a solution, it seems even if we get copyto!(::CuArray, ::Tuple) (which fwiw was suggested on Slack that it might be possible), its not enough to write the code here generically.

marius311 · 2023-01-25T20:07:05Z

I should also note that one nice thing about the copyto! option is that it would allow the specialized copyto!(::CuArray, ::Tuple) to pass the tuple data into the kernel by-reference as opposed by-value which is whats currently happening (which has a 4K memory limit, as pointed out on Slack). I'm not sure if that's a real worry except for chunk sizes that are so large theyd be unrealistic, but wanted to point that out.

But in any case, I can't fiure out how to do it with only copyto! while avoiding regressions, c.f. the comment just above.

devmotion · 2023-01-25T20:24:53Z

How do you implement extract_gradient_chunk!(::Type{T}, result::AbstractArray, y::Real, index, chunksize) so thats its inferred and allocation free for Arrays?

With fill! as done in the definition of extract_gradient! for y::Real, I would assume?

marius311 · 2023-01-25T20:30:52Z

Thanks, fill! doesn't have a variant with an offset / chunk size though unless I'm mistaken?

devmotion · 2023-01-25T20:33:14Z

Thanks, fill! doesn't have a variant with an offset / chunk size though unless I'm mistaken?

No, not that I know of. But couldn't you fill! a view?

I still think that it would be nice if we wouldn't have to use slightly unorthodox implementations, just to be able to add implicit support for another package.

To mention the obvious, on Julia >= 1.9 yet another alternative would be to add e.g. GPUArraysCore (I guess that might be sufficient?) as a weak dependency and add a special method for GPU arrays there. The disadvantages would be that it would not fix Julia < 1.9.

marius311 · 2023-01-25T20:33:59Z

No, not that I know of. But couldn't you fill! a view?

No, thats allocating for multidimensional Arrays, which is the entire issue that led to the solution currently in this PR.

marius311 · 2023-01-25T20:36:44Z

I still think that it would be nice if we wouldn't have to use slightly unorthodox implementations, just to be able to add implicit support for another package.

Yea, I'm sympathetic to that, also bc the by-value thing, but ultimately although the code looks a little wierd, in the case of Array its purely relying on the get/set index interface, and in the case of CuArrays, on the map! interface, which are definitely very orthodox.

devmotion · 2023-01-25T20:44:56Z

the map! interface, which are definitely very orthodox.

Oh, with unorthodox I was referring rather to the fact that currently the PR removes a simple copyto! statement and adds a workaround based on ReshapedArray.

marius311 · 2023-01-25T20:53:50Z

No I get that. All I'm saying is its unorthodox in terms of the usage of ReshapedArray sure, but in terms of the underlying native code, for Arrays, its compiled to nearly identical code as before this PR, since either this or the old for-loop all ultimately end up at setindex!(::Array, ..).

devmotion · 2023-01-25T20:55:05Z

No, thats allocating for multidimensional Arrays, which is the entire issue that led to the solution currently in this PR.

We could use fill! + Base.ReshapedArray 😄 That seems to be faster than the current for-loop:

julia> function h2(::Type{T}, result, x, index, chunksize) where {T}
           fill!(view(Base.ReshapedArray(result, (length(result),), ()), index:(index + chunksize - 1)), zero(x))
           return result
       end
h2 (generic function with 1 method)

julia> @btime h2(Float64, $(rand(100, 2)), 0f0, 20, 15);
  2.570 ns (0 allocations: 0 bytes)

julia> function h3(::Type{T}, result, x, index, chunksize) where {T}
           offset = index - 1
           for i in 1:chunksize
               result[i + offset] = zero(x)
           end
           return result
       end
h3 (generic function with 1 method)

julia> @btime h3(Float64, $(rand(100, 2)), 0f0, 20, 15);
  5.338 ns (0 allocations: 0 bytes)

marius311 · 2023-01-25T20:55:46Z

Sorry I'm confused, if you're ok with ReshapedArray there, whats the issue with the current PR?

devmotion · 2023-01-25T21:00:05Z

Well, I still think that 1) replacing the copyto! and dispatching to the chunk size-based implementation and 2) the map!+ReshapedArray are both very unnatural workarounds for a very specific package and use case. But of course I wanted to benchmark Base.ReshapedArray for the example. BTW, as a comparison the code in the PR seems to perform worse (than both examples):

julia> function h4(::Type{T}, result, x, index, chunksize) where {T}
           map!(view(Base.ReshapedArray(result, (length(result),), ()), index:(index + chunksize - 1)), 1:chunksize) do _
               return zero(x)
           end
           return result
       end
h4 (generic function with 1 method)

julia> @btime h4(Float64, $(rand(100, 2)), 0f0, 20, 15);
  10.412 ns (0 allocations: 0 bytes)

(I'm not too surprised that it's easier for the compiler to optimize the for loop than the map!.)

devmotion · 2023-01-25T21:04:14Z

So it seems the PR actually introduces a regression in its current form?

marius311 · 2023-01-25T21:30:01Z

I had benchmarked some full gradient calls, and there I saw essentially no difference (presumably extracting the gradient is not a dominant part of the computation except for trivial functions). But you're right it looks like the extract_gradient_chunk! code alone is a little slower for y::Real, but also try the case y::Dual, on my machine its only like 30% slower (do the exact code in the PR):

# PR
julia> @btime ForwardDiff.extract_gradient_chunk!(Nothing, $(rand(100, 2)), $(ForwardDiff.Dual(rand(16)...)), 20, 15);
  10.052 ns (0 allocations: 0 bytes)

# master
julia> @btime ForwardDiff.extract_gradient_chunk!(Nothing, $(rand(100, 2)), $(ForwardDiff.Dual(rand(16)...)), 20, 15);
  8.530 ns (0 allocations: 0 bytes)

However, copyto! is like 5X slower than that on my machine, so thats not actually an alternative.

So at the moment I guess it boils down to a small speed regression in extract_gradient_chunk which should negligibly impact realistic gradient calls in exchange for having this package work on GPU.

devmotion

OK, so what do you think about adding an implementation of extract_gradient_chunk!(::Type{T}, result, dual::Real, index, chunksize) using fill! with ReshapedArray which seems to be fast? The PR already uses ReshapedArray anyway, so if we go with it, we might as well just avoid the regression and use ReshapedArray once more.

And afterwards maybe check that there are really no regressions left?

devmotion · 2023-01-26T00:56:09Z

src/gradient.jl

-    offset = index - 1
-    for i in 1:chunksize
-        result[i + offset] = partials(T, dual, i)
+    map!(view(Base.ReshapedArray(result, (length(result),), ()), index:index+chunksize-1), 1:chunksize) do i


I was looking around a bit and also browsing old julia issues - maybe we just have to use ReshapedArray for optimal performance, even though it's unexported.

In any case, I think it would be good to add some explanations for why the function is implemented in this way:

Suggested change

map!(view(Base.ReshapedArray(result, (length(result),), ()), index:index+chunksize-1), 1:chunksize) do i

# `reshape` without allocations: https://github.com/JuliaLang/julia/issues/36313

out = view(Base.ReshapedArray(result, (length(result),), ()), index:(index+chunksize-1)), 1:chunksize)

# use `map!` instead of a for-loop for GPU compatibility: #619

map!(out) do i

devmotion · 2023-01-26T00:59:10Z

src/gradient.jl

-    for i in 1:chunksize
-        result[i + offset] = partials(T, dual, i)
+    map!(view(Base.ReshapedArray(result, (length(result),), ()), index:index+chunksize-1), 1:chunksize) do i
+        @inbounds partials(T, dual, i)


Is this really safe? The current implementation does not use @inbounds:

Suggested change

@inbounds partials(T, dual, i)

return partials(T, dual, i)

devmotion · 2023-01-26T01:05:09Z

src/gradient.jl

+extract_gradient!(::Type{T}, result::AbstractArray, dual::Dual) where {T} =
+    extract_gradient_chunk!(T, result, dual, 1, npartials(dual))


I'm still somewhat unhappy about this change, intuitively it feels that it should be easier for base and packages to optimize copyto!(::AbstractArray, ::Tuple) than the chunk-based version. But if benchmarks don't show any regression, I guess it seems OK - and we don't have to wait for CUDA to add the implementation.

vpuri3 · 2023-11-01T20:28:01Z

@marius311 is this still being worked on?

vpuri3 · 2023-11-01T20:41:21Z

Was this not merged earlier due to a regression on presently working cases? If so, then we can move this implementation to an extension package ForwardDiffGPUArraysExt which defines ForwardDiff.extract_gradient_chunk!(::Type{T}, ::AbstractGPUArray, ...)

marius311 · 2023-11-01T21:47:53Z

My summary from what I remember is what I said above:

at the moment I guess it boils down to a small speed regression in extract_gradient_chunk which should negligibly impact realistic gradient calls in exchange for having this package work on GPU

I unfortunately ran out of time to dedicate to responding to the requests here but anyone is absolutely welcome to finish it off or close it and do your extension package solution.

vpuri3 · 2023-11-03T20:51:27Z

would that be a workable solution in your opinion, @devmotion?

vpuri3 · 2023-11-15T14:20:38Z

@devmotion

vpuri3 · 2023-11-16T18:03:45Z

@mcabbott what do you think about moving this change to a GPUArrays ext?

mcabbott reviewed Dec 20, 2022

View reviewed changes

src/gradient.jl Outdated Show resolved Hide resolved

devmotion reviewed Dec 21, 2022

View reviewed changes

src/gradient.jl Outdated Show resolved Hide resolved

make extract_gradient[_chunk]! GPU compatible

9dc9769

marius311 force-pushed the cuda_extract branch from 0f0d0d9 to 9dc9769 Compare December 22, 2022 08:14

marius311 requested a review from devmotion January 5, 2023 20:33

devmotion reviewed Jan 5, 2023

View reviewed changes

devmotion reviewed Jan 26, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make extract_gradient[_chunk]! GPU compatible #619

make extract_gradient[_chunk]! GPU compatible #619

marius311 commented Dec 20, 2022

codecov-commenter commented Dec 20, 2022 •

edited

Loading

mcabbott left a comment

marius311 commented Dec 20, 2022

marius311 commented Dec 22, 2022 •

edited

Loading

marius311 commented Jan 3, 2023

devmotion left a comment

marius311 commented Jan 5, 2023

marius311 commented Jan 24, 2023 •

edited

Loading

marius311 commented Jan 25, 2023 •

edited

Loading

devmotion commented Jan 25, 2023

marius311 commented Jan 25, 2023

devmotion commented Jan 25, 2023

marius311 commented Jan 25, 2023

marius311 commented Jan 25, 2023

devmotion commented Jan 25, 2023

marius311 commented Jan 25, 2023 •

edited

Loading

devmotion commented Jan 25, 2023

marius311 commented Jan 25, 2023 •

edited

Loading

devmotion commented Jan 25, 2023 •

edited

Loading

devmotion commented Jan 25, 2023

marius311 commented Jan 25, 2023

devmotion left a comment

devmotion Jan 26, 2023

devmotion Jan 26, 2023

devmotion Jan 26, 2023

vpuri3 commented Nov 1, 2023

vpuri3 commented Nov 1, 2023

marius311 commented Nov 1, 2023

vpuri3 commented Nov 3, 2023

vpuri3 commented Nov 15, 2023

vpuri3 commented Nov 16, 2023

-    map!(view(Base.ReshapedArray(result, (length(result),), ()), index:index+chunksize-1), 1:chunksize) do i
+    # `reshape` without allocations: https://github.com/JuliaLang/julia/issues/36313
+    out = view(Base.ReshapedArray(result, (length(result),), ()), index:(index+chunksize-1)), 1:chunksize)
+    # use `map!` instead of a for-loop for GPU compatibility: #619
+    map!(out) do i

		extract_gradient!(::Type{T}, result::AbstractArray, dual::Dual) where {T} =
		extract_gradient_chunk!(T, result, dual, 1, npartials(dual))

make extract_gradient[_chunk]! GPU compatible #619

Are you sure you want to change the base?

make extract_gradient[_chunk]! GPU compatible #619

Conversation

marius311 commented Dec 20, 2022

codecov-commenter commented Dec 20, 2022 • edited Loading

Codecov Report

mcabbott left a comment

Choose a reason for hiding this comment

marius311 commented Dec 20, 2022

marius311 commented Dec 22, 2022 • edited Loading

marius311 commented Jan 3, 2023

devmotion left a comment

Choose a reason for hiding this comment

marius311 commented Jan 5, 2023

marius311 commented Jan 24, 2023 • edited Loading

marius311 commented Jan 25, 2023 • edited Loading

devmotion commented Jan 25, 2023

marius311 commented Jan 25, 2023

devmotion commented Jan 25, 2023

marius311 commented Jan 25, 2023

marius311 commented Jan 25, 2023

devmotion commented Jan 25, 2023

marius311 commented Jan 25, 2023 • edited Loading

devmotion commented Jan 25, 2023

marius311 commented Jan 25, 2023 • edited Loading

devmotion commented Jan 25, 2023 • edited Loading

devmotion commented Jan 25, 2023

marius311 commented Jan 25, 2023

devmotion left a comment

Choose a reason for hiding this comment

devmotion Jan 26, 2023

Choose a reason for hiding this comment

devmotion Jan 26, 2023

Choose a reason for hiding this comment

devmotion Jan 26, 2023

Choose a reason for hiding this comment

vpuri3 commented Nov 1, 2023

vpuri3 commented Nov 1, 2023

marius311 commented Nov 1, 2023

vpuri3 commented Nov 3, 2023

vpuri3 commented Nov 15, 2023

vpuri3 commented Nov 16, 2023

codecov-commenter commented Dec 20, 2022 •

edited

Loading

marius311 commented Dec 22, 2022 •

edited

Loading

marius311 commented Jan 24, 2023 •

edited

Loading

marius311 commented Jan 25, 2023 •

edited

Loading

marius311 commented Jan 25, 2023 •

edited

Loading

marius311 commented Jan 25, 2023 •

edited

Loading

devmotion commented Jan 25, 2023 •

edited

Loading