Skip to content
This repository has been archived by the owner on May 21, 2022. It is now read-only.

Automatic Differentiation tools #1

Open
tbreloff opened this issue Aug 29, 2016 · 10 comments
Open

Automatic Differentiation tools #1

tbreloff opened this issue Aug 29, 2016 · 10 comments

Comments

@tbreloff
Copy link
Member

I think it would be smart to attempt to use the great tools already created if they can fit our needs, or improved to fit our needs. Here's a summary of what I hope to see in JuliaML, and my first impressions of the options.

I want:

  • simple macros/methods to take a complex function and produce a type-stable and fast gradient calculation of all free/learnable parameters
  • simple definition of new "primitives": losses, penalties, transformations, etc for inclusion in AD
  • clean underlying internals (computation graph) so that we can:
    • parallelize
    • plot
    • evolve the structure dynamically

AutoGrad.jl (@denizyuret)

I get the impression that this is very flexible and feature full, but the speed worries me. It seems that the function g in g = grad(f) actually contains a forward pass, and graph construction, and a backward pass all in one. Clearly this is not optimal if the graph structure is unchanged in each iteration. If there's a clean way to separate graph construction from a backward gradient calculation, then this package might be really nice. I don't know enough about it to comment on the flexibility of the internal representation though.

ForwardDiff (@jrevels @lbenet)

Doesn't seem practical for anything beyond toy problems. Unless I'm missing something?

ReverseDiffSource

My gut is that it will be possible to evolve, manipulate, and plot the underlying graph, as it's explicitly built (as opposed to using traces). I worry about the type stability... I think I remember reading about looping through Vector{Any} objects in hot loops.

ReverseDiffOverload

Last commit was 2 years ago

ReverseDiffSparse (@mlubin)

I don't know much about this yet


Please, add your thoughts and opinions.

@tbreloff
Copy link
Member Author

I should note that this issue might be more appropriate in ObjectiveFunctions.jl, but we can have the discussion here

@mlubin
Copy link

mlubin commented Aug 29, 2016

JuMP (via ReverseDiffSparse) is probably the fastest and most reliable reverse-mode implementation in julia at this point. It's also the only AD tool I'm aware of that does sparse Hessians. The downside is that you have to rewrite your functions into JuMP syntax. Your decision if it's worth the tradeoff.

@tbreloff
Copy link
Member Author

Thanks for the comment @mlubin. Do you think that's true for all types of functions? Dense and Sparse? What would you say are the biggest limitations of the package (aside from syntax)?

@mlubin
Copy link

mlubin commented Aug 29, 2016

Functions with dense derivatives work okay as well. If you're just asking for the gradient then you'll get it back at a small constant times the cost of evaluating the function, as reverse mode promises.
@jeff-regier recently did some tests on a pretty substantial code base and found that JuMP's reverse mode was about 10x slower than manually coded derivatives.

@lbenet
Copy link

lbenet commented Aug 29, 2016

TaylorSeries.jl does compute jacobians and hessians as well. Yet, this is not the primary intention of the package, though I wouldn't classify it as being unable to do anything beyond toymodels (nor ForwarDiff.jl). TaylorSeries.jl is intended to compute higher-order derivatives of one- or many-independent variable functions through recursion formulas, e.g., to make accurate integrations of ODEs.

From the description of what you would like a package to do it is probably not ready; the API is rather primitive, though it has good performance.

@datnamer
Copy link

There is also this: https://github.com/jrevels/ReverseDiffPrototype.jl

@jrevels
Copy link

jrevels commented Aug 29, 2016

Since @datnamer brought it up: Yeah, I'm working on a reverse-mode AD package for native Julia code. It's pretty feature complete as it is, and I'd say the performance is really promising, but there is still a good bit of testing/documentation/optimization to do before it's release-ready.

[Forward-mode AD] doesn't seem practical for anything beyond toy problems. Unless I'm missing something?

In general, forward-mode is faster than reverse-mode for differentiating functions whose domains contain fewer dimensions than their range, and is faster in practice for a lot of elementary scalar functions. It's also useful in cases where reverse-mode would require more memory than is feasible. Furthermore, it can be leveraged as a component of reverse-mode implementations (both JuMP and ReverseDiffPrototype perform intermediary forward-mode AD during their forward pass).

It seems that the function g in g = grad(f) actually contains a forward pass, and graph construction, and a backward pass all in one. Clearly this is not optimal if the graph structure is unchanged in each iteration.

This is a necessity if you need to be able differentiate arbitrary code, in which the graph structure could be dependent on input values. With https://github.com/jrevels/ReverseDiffPrototype.jl, I plan on eventually allowing users to easily specify that their functions are data-independent so that the graph can be cached (and further down the line, unrolled and compiled to the GPU).

@Evizero
Copy link
Member

Evizero commented Aug 29, 2016

also relevant: https://github.com/dfdx/Espresso.jl

@tbreloff
Copy link
Member Author

Thanks Christof. Espresso is really similar to what I was building in
Transformations. I have a lot to review!!

On Monday, August 29, 2016, Christof Stocker [email protected]
wrote:

also relevant: https://github.com/dfdx/Espresso.jl


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA492vmez1v7-YUbWTu_QLNw9w3Pfbe4ks5qkxY2gaJpZM4JvfqN
.

@tbreloff
Copy link
Member Author

@jrevels "Toy problems" sounds worse than I intended. I mean more that I'd like a solution that can be scaled to complex deep learning models with many free parameters. One would presumably have a complex model where the learnable parameters are either SubArrays of a large parameter vector, or the main parameter vector is a CatView. I want super fast ways of populating a gradient vector that corresponds to the partial derivatives wrt each param, which could then be used by some solver/learner to update.

For many problems, the gradient formula doesn't change from one iteration to the next. If this is as fast as possible, I'll probably be happy. It seems as though your prototype and Espresso are attempting to tackle this, which is great news.

If it helps, this is the sort of design that I have in my head. Method/macro names can be changed... I'm just trying to illustrate concepts:

  • Each op/method/transformation has an input dimension I and an output dimension O. These are fixed relative to each other, and are parameters of the type which is generated.
  • The @op macro creates a new type and constructor(s) which will build instances of this type. So a Affine(10,5) is actually constructing an Affine{1,1}(10, 5) where:
size(x) == (10,)
size(w) == (5, 10)
size(b) == (5,)

The idea is that the code for the forward and backward passes depends on the types/dimensions... not on the actual sizes involved. This implies we could build a "computation graph blueprint" for both the forward and backward passes, and cache/reuse it for anything with the same dimensions (it's the same code whether the matrix is 5x10 or 10x10). One could chain together blueprints to make more complex blueprints, and many/most of the dimension parameters could be inferred/set by incoming and outgoing nodes.

This design seems nice because you could construct new graphs and subgraphs, add to or delete from existing graphs, or swap out sections of graphs, all without recomputing the gradient function or recompiling anything.

I get the feeling that you can only accomplish this design by working with raw AST to generate specialized types and methods. It seems Espresso might have some of the needed toolset... I'll need to review the other options in more detail to have a better opinion.

Thanks for all the comments, and the great packages. I'd love to hear thoughts on this design (and please be brutal as long as you keep it constructive ;) If you want to see some of my initial experiments, check out this

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants