-
Notifications
You must be signed in to change notification settings - Fork 10
Automatic Differentiation tools #1
Comments
I should note that this issue might be more appropriate in ObjectiveFunctions.jl, but we can have the discussion here |
JuMP (via ReverseDiffSparse) is probably the fastest and most reliable reverse-mode implementation in julia at this point. It's also the only AD tool I'm aware of that does sparse Hessians. The downside is that you have to rewrite your functions into JuMP syntax. Your decision if it's worth the tradeoff. |
Thanks for the comment @mlubin. Do you think that's true for all types of functions? Dense and Sparse? What would you say are the biggest limitations of the package (aside from syntax)? |
Functions with dense derivatives work okay as well. If you're just asking for the gradient then you'll get it back at a small constant times the cost of evaluating the function, as reverse mode promises. |
From the description of what you would like a package to do it is probably not ready; the API is rather primitive, though it has good performance. |
There is also this: https://github.com/jrevels/ReverseDiffPrototype.jl |
Since @datnamer brought it up: Yeah, I'm working on a reverse-mode AD package for native Julia code. It's pretty feature complete as it is, and I'd say the performance is really promising, but there is still a good bit of testing/documentation/optimization to do before it's release-ready.
In general, forward-mode is faster than reverse-mode for differentiating functions whose domains contain fewer dimensions than their range, and is faster in practice for a lot of elementary scalar functions. It's also useful in cases where reverse-mode would require more memory than is feasible. Furthermore, it can be leveraged as a component of reverse-mode implementations (both JuMP and ReverseDiffPrototype perform intermediary forward-mode AD during their forward pass).
This is a necessity if you need to be able differentiate arbitrary code, in which the graph structure could be dependent on input values. With https://github.com/jrevels/ReverseDiffPrototype.jl, I plan on eventually allowing users to easily specify that their functions are data-independent so that the graph can be cached (and further down the line, unrolled and compiled to the GPU). |
also relevant: https://github.com/dfdx/Espresso.jl |
Thanks Christof. Espresso is really similar to what I was building in On Monday, August 29, 2016, Christof Stocker [email protected]
|
@jrevels "Toy problems" sounds worse than I intended. I mean more that I'd like a solution that can be scaled to complex deep learning models with many free parameters. One would presumably have a complex model where the learnable parameters are either SubArrays of a large parameter vector, or the main parameter vector is a CatView. I want super fast ways of populating a gradient vector that corresponds to the partial derivatives wrt each param, which could then be used by some solver/learner to update. For many problems, the gradient formula doesn't change from one iteration to the next. If this is as fast as possible, I'll probably be happy. It seems as though your prototype and Espresso are attempting to tackle this, which is great news. If it helps, this is the sort of design that I have in my head. Method/macro names can be changed... I'm just trying to illustrate concepts:
size(x) == (10,)
size(w) == (5, 10)
size(b) == (5,) The idea is that the code for the forward and backward passes depends on the types/dimensions... not on the actual sizes involved. This implies we could build a "computation graph blueprint" for both the forward and backward passes, and cache/reuse it for anything with the same dimensions (it's the same code whether the matrix is 5x10 or 10x10). One could chain together blueprints to make more complex blueprints, and many/most of the dimension parameters could be inferred/set by incoming and outgoing nodes. This design seems nice because you could construct new graphs and subgraphs, add to or delete from existing graphs, or swap out sections of graphs, all without recomputing the gradient function or recompiling anything. I get the feeling that you can only accomplish this design by working with raw AST to generate specialized types and methods. It seems Espresso might have some of the needed toolset... I'll need to review the other options in more detail to have a better opinion. Thanks for all the comments, and the great packages. I'd love to hear thoughts on this design (and please be brutal as long as you keep it constructive ;) If you want to see some of my initial experiments, check out this |
I think it would be smart to attempt to use the great tools already created if they can fit our needs, or improved to fit our needs. Here's a summary of what I hope to see in JuliaML, and my first impressions of the options.
I want:
AutoGrad.jl (@denizyuret)
I get the impression that this is very flexible and feature full, but the speed worries me. It seems that the function
g
ing = grad(f)
actually contains a forward pass, and graph construction, and a backward pass all in one. Clearly this is not optimal if the graph structure is unchanged in each iteration. If there's a clean way to separate graph construction from a backward gradient calculation, then this package might be really nice. I don't know enough about it to comment on the flexibility of the internal representation though.ForwardDiff (@jrevels @lbenet)
Doesn't seem practical for anything beyond toy problems. Unless I'm missing something?
ReverseDiffSource
My gut is that it will be possible to evolve, manipulate, and plot the underlying graph, as it's explicitly built (as opposed to using traces). I worry about the type stability... I think I remember reading about looping through
Vector{Any}
objects in hot loops.ReverseDiffOverload
Last commit was 2 years ago
ReverseDiffSparse (@mlubin)
I don't know much about this yet
Please, add your thoughts and opinions.
The text was updated successfully, but these errors were encountered: