Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make extract_gradient[_chunk]! GPU compatible #619

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

marius311
Copy link

Fixes this:

ForwardDiff.gradient(sum, cu(rand(10))) # Error: scalar indexing

My adhoc tests suggest no performance impact on 1.8.3. for result::Vector, however, there is an extra (2 allocations: 80 bytes) for result::Matrix I think due to a reshape/view thing.

I'm not sure if there's some other form of the code that works without scalar indexing and without this allocation, but open to suggestions. Another alternative is to write some gluecode here or in CUDA.jl that handles CuArray specifically, but that seem more brittle.

I'd vote this solution is OK (pending other feedback) for the benefit of CUDA compatibility with generic code, but thats just IMO.

@codecov-commenter
Copy link

codecov-commenter commented Dec 20, 2022

Codecov Report

Base: 86.83% // Head: 89.74% // Increases project coverage by +2.90% 🎉

Coverage data is based on head (9dc9769) compared to base (6a19554).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #619      +/-   ##
==========================================
+ Coverage   86.83%   89.74%   +2.90%     
==========================================
  Files           9       10       +1     
  Lines         889      965      +76     
==========================================
+ Hits          772      866      +94     
+ Misses        117       99      -18     
Impacted Files Coverage Δ
src/gradient.jl 98.92% <100.00%> (+0.06%) ⬆️
src/apiutils.jl 100.00% <0.00%> (ø)
src/ForwardDiff.jl 100.00% <0.00%> (ø)
src/jacobian.jl 99.34% <0.00%> (+0.03%) ⬆️
src/derivative.jl 94.87% <0.00%> (+0.13%) ⬆️
src/partials.jl 84.34% <0.00%> (+0.13%) ⬆️
src/config.jl 87.27% <0.00%> (+0.23%) ⬆️
src/dual.jl 82.77% <0.00%> (+3.01%) ⬆️
src/prelude.jl 93.93% <0.00%> (+56.43%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@mcabbott mcabbott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems surely harmless.

Maybe it would be worth adding some tests, using JLArrays so that they can run on githun?

src/gradient.jl Outdated Show resolved Hide resolved
@marius311
Copy link
Author

Added a barebones set of JLArray tests.

src/gradient.jl Outdated Show resolved Hide resolved
@marius311
Copy link
Author

marius311 commented Dec 22, 2022

Ok, found the magic code which gives no allocation or speed regressions vs. current master for Arrays of any dimensions and which works with CuArrays, its this:

map!(view(Base.ReshapedArray(result, (length(result),), ()), index:index+chunksize-1), 1:chunksize) do i
    @inbounds partials(T, dual, i)
end

Trick is indeed as @mcabbott you suggested to use a ReshapedArray which unlike reshape is non-allocating for Arrays (TIL). And CUDA.jl is smart enough to dispatch on view(ReshapedArray(CuArray)) and do its thing.

Squashed into a single commit. Ready on my end.

@marius311
Copy link
Author

Don't want to have this lost after the holidays, could this get looked at again?

Copy link
Member

@devmotion devmotion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this seems still mainly a CUDA issue. Changing the definitions to copyto! operating on Tuples seems the most direct and natural implementation to me, and one that I would expect to work for any array. Whereas to me the code in this PR seems to be a bit non-intuitive and solely due to CUDA. So I would suggest

  • using copyto! with Tuples (as suggested in more detail above)
  • adding definitions of copyto! with Tuple to CUDA

@marius311
Copy link
Author

adding definitions of copyto! with Tuple to CUDA

That seems like a strech to me with CUDA tbh. Although these things aren't specified perfectly, copyto! seems aimed to be an array interface, not really for tuples. I would expect such a PR not to be accepted in CUDA, but we can try.

In the meantime, this PR has no regressions but enables GPU, so can it still be merged? And we can revist later if we get the CUDA interface?

@marius311
Copy link
Author

marius311 commented Jan 24, 2023

Here's an issue with the copyto! option. How do you implement extract_gradient_chunk!(::Type{T}, result::AbstractArray, y::Real, index, chunksize) so thats its inferred and allocation free for Arrays? Closest I can come up with is:

copyto!(result, index, ntuple(i->0, Val(chunksize)), 1, chunksize)

which is not inferred, although the Julia compiler must work some magic bc its allocation free. But its 2X slower than my solution in this PR.

So in lieu of a solution, it seems even if we get copyto!(::CuArray, ::Tuple) (which fwiw was suggested on Slack that it might be possible), its not enough to write the code here generically.

@marius311
Copy link
Author

marius311 commented Jan 25, 2023

I should also note that one nice thing about the copyto! option is that it would allow the specialized copyto!(::CuArray, ::Tuple) to pass the tuple data into the kernel by-reference as opposed by-value which is whats currently happening (which has a 4K memory limit, as pointed out on Slack). I'm not sure if that's a real worry except for chunk sizes that are so large theyd be unrealistic, but wanted to point that out.

But in any case, I can't fiure out how to do it with only copyto! while avoiding regressions, c.f. the comment just above.

@devmotion
Copy link
Member

How do you implement extract_gradient_chunk!(::Type{T}, result::AbstractArray, y::Real, index, chunksize) so thats its inferred and allocation free for Arrays?

With fill! as done in the definition of extract_gradient! for y::Real, I would assume?

@marius311
Copy link
Author

Thanks, fill! doesn't have a variant with an offset / chunk size though unless I'm mistaken?

@devmotion
Copy link
Member

Thanks, fill! doesn't have a variant with an offset / chunk size though unless I'm mistaken?

No, not that I know of. But couldn't you fill! a view?

I still think that it would be nice if we wouldn't have to use slightly unorthodox implementations, just to be able to add implicit support for another package.

To mention the obvious, on Julia >= 1.9 yet another alternative would be to add e.g. GPUArraysCore (I guess that might be sufficient?) as a weak dependency and add a special method for GPU arrays there. The disadvantages would be that it would not fix Julia < 1.9.

@marius311
Copy link
Author

No, not that I know of. But couldn't you fill! a view?

No, thats allocating for multidimensional Arrays, which is the entire issue that led to the solution currently in this PR.

@marius311
Copy link
Author

I still think that it would be nice if we wouldn't have to use slightly unorthodox implementations, just to be able to add implicit support for another package.

Yea, I'm sympathetic to that, also bc the by-value thing, but ultimately although the code looks a little wierd, in the case of Array its purely relying on the get/set index interface, and in the case of CuArrays, on the map! interface, which are definitely very orthodox.

@devmotion
Copy link
Member

the map! interface, which are definitely very orthodox.

Oh, with unorthodox I was referring rather to the fact that currently the PR removes a simple copyto! statement and adds a workaround based on ReshapedArray.

@marius311
Copy link
Author

marius311 commented Jan 25, 2023

No I get that. All I'm saying is its unorthodox in terms of the usage of ReshapedArray sure, but in terms of the underlying native code, for Arrays, its compiled to nearly identical code as before this PR, since either this or the old for-loop all ultimately end up at setindex!(::Array, ..).

@devmotion
Copy link
Member

No, thats allocating for multidimensional Arrays, which is the entire issue that led to the solution currently in this PR.

We could use fill! + Base.ReshapedArray 😄 That seems to be faster than the current for-loop:

julia> function h2(::Type{T}, result, x, index, chunksize) where {T}
           fill!(view(Base.ReshapedArray(result, (length(result),), ()), index:(index + chunksize - 1)), zero(x))
           return result
       end
h2 (generic function with 1 method)

julia> @btime h2(Float64, $(rand(100, 2)), 0f0, 20, 15);
  2.570 ns (0 allocations: 0 bytes)

julia> function h3(::Type{T}, result, x, index, chunksize) where {T}
           offset = index - 1
           for i in 1:chunksize
               result[i + offset] = zero(x)
           end
           return result
       end
h3 (generic function with 1 method)

julia> @btime h3(Float64, $(rand(100, 2)), 0f0, 20, 15);
  5.338 ns (0 allocations: 0 bytes)

@marius311
Copy link
Author

marius311 commented Jan 25, 2023

Sorry I'm confused, if you're ok with ReshapedArray there, whats the issue with the current PR?

@devmotion
Copy link
Member

devmotion commented Jan 25, 2023

Well, I still think that 1) replacing the copyto! and dispatching to the chunk size-based implementation and 2) the map!+ReshapedArray are both very unnatural workarounds for a very specific package and use case. But of course I wanted to benchmark Base.ReshapedArray for the example. BTW, as a comparison the code in the PR seems to perform worse (than both examples):

julia> function h4(::Type{T}, result, x, index, chunksize) where {T}
           map!(view(Base.ReshapedArray(result, (length(result),), ()), index:(index + chunksize - 1)), 1:chunksize) do _
               return zero(x)
           end
           return result
       end
h4 (generic function with 1 method)

julia> @btime h4(Float64, $(rand(100, 2)), 0f0, 20, 15);
  10.412 ns (0 allocations: 0 bytes)

(I'm not too surprised that it's easier for the compiler to optimize the for loop than the map!.)

@devmotion
Copy link
Member

So it seems the PR actually introduces a regression in its current form?

@marius311
Copy link
Author

I had benchmarked some full gradient calls, and there I saw essentially no difference (presumably extracting the gradient is not a dominant part of the computation except for trivial functions). But you're right it looks like the extract_gradient_chunk! code alone is a little slower for y::Real, but also try the case y::Dual, on my machine its only like 30% slower (do the exact code in the PR):

# PR
julia> @btime ForwardDiff.extract_gradient_chunk!(Nothing, $(rand(100, 2)), $(ForwardDiff.Dual(rand(16)...)), 20, 15);
  10.052 ns (0 allocations: 0 bytes)

# master
julia> @btime ForwardDiff.extract_gradient_chunk!(Nothing, $(rand(100, 2)), $(ForwardDiff.Dual(rand(16)...)), 20, 15);
  8.530 ns (0 allocations: 0 bytes)

However, copyto! is like 5X slower than that on my machine, so thats not actually an alternative.

So at the moment I guess it boils down to a small speed regression in extract_gradient_chunk which should negligibly impact realistic gradient calls in exchange for having this package work on GPU.

Copy link
Member

@devmotion devmotion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so what do you think about adding an implementation of extract_gradient_chunk!(::Type{T}, result, dual::Real, index, chunksize) using fill! with ReshapedArray which seems to be fast? The PR already uses ReshapedArray anyway, so if we go with it, we might as well just avoid the regression and use ReshapedArray once more.

And afterwards maybe check that there are really no regressions left?

offset = index - 1
for i in 1:chunksize
result[i + offset] = partials(T, dual, i)
map!(view(Base.ReshapedArray(result, (length(result),), ()), index:index+chunksize-1), 1:chunksize) do i
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking around a bit and also browsing old julia issues - maybe we just have to use ReshapedArray for optimal performance, even though it's unexported.

In any case, I think it would be good to add some explanations for why the function is implemented in this way:

Suggested change
map!(view(Base.ReshapedArray(result, (length(result),), ()), index:index+chunksize-1), 1:chunksize) do i
# `reshape` without allocations: https://github.com/JuliaLang/julia/issues/36313
out = view(Base.ReshapedArray(result, (length(result),), ()), index:(index+chunksize-1)), 1:chunksize)
# use `map!` instead of a for-loop for GPU compatibility: #619
map!(out) do i

for i in 1:chunksize
result[i + offset] = partials(T, dual, i)
map!(view(Base.ReshapedArray(result, (length(result),), ()), index:index+chunksize-1), 1:chunksize) do i
@inbounds partials(T, dual, i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really safe? The current implementation does not use @inbounds:

Suggested change
@inbounds partials(T, dual, i)
return partials(T, dual, i)

Comment on lines +83 to +84
extract_gradient!(::Type{T}, result::AbstractArray, dual::Dual) where {T} =
extract_gradient_chunk!(T, result, dual, 1, npartials(dual))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still somewhat unhappy about this change, intuitively it feels that it should be easier for base and packages to optimize copyto!(::AbstractArray, ::Tuple) than the chunk-based version. But if benchmarks don't show any regression, I guess it seems OK - and we don't have to wait for CUDA to add the implementation.

@vpuri3
Copy link

vpuri3 commented Nov 1, 2023

@marius311 is this still being worked on?

@vpuri3
Copy link

vpuri3 commented Nov 1, 2023

Was this not merged earlier due to a regression on presently working cases? If so, then we can move this implementation to an extension package ForwardDiffGPUArraysExt which defines ForwardDiff.extract_gradient_chunk!(::Type{T}, ::AbstractGPUArray, ...)

@marius311
Copy link
Author

My summary from what I remember is what I said above:

at the moment I guess it boils down to a small speed regression in extract_gradient_chunk which should negligibly impact realistic gradient calls in exchange for having this package work on GPU

I unfortunately ran out of time to dedicate to responding to the requests here but anyone is absolutely welcome to finish it off or close it and do your extension package solution.

@vpuri3
Copy link

vpuri3 commented Nov 3, 2023

would that be a workable solution in your opinion, @devmotion?

@vpuri3
Copy link

vpuri3 commented Nov 15, 2023

@devmotion

@vpuri3
Copy link

vpuri3 commented Nov 16, 2023

@mcabbott what do you think about moving this change to a GPUArrays ext?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants