Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removing old deprecations and Optimise module #2412

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
19 changes: 13 additions & 6 deletions docs/src/destructure.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,20 +49,27 @@ julia> Flux.destructure(grad) # acts on non-models, too
(Float32[10.339018, 11.379145, 22.845667, -29.565302, -37.644184], Restructure(Tuple, ..., 5))
```

!!! compat "Flux ≤ 0.12"
Old versions of Flux had an entirely different implementation of `destructure`, which
had many bugs (and almost no tests). Many comments online still refer to that now-deleted
function, or to memories of it.
In order to collect all parameters of a model into a list instead, you can use the `trainables` function:

```julia
julia> Flux.trainables(model)
5-element Vector{AbstractArray}:
[0.863101 1.2454957]
[0.0]
[1.290355429422727;;]
[0.0]
```
Any mutation of the elements of the resulting list will affect the model's parameters.

### All Parameters

The function `destructure` now lives in [`Optimisers.jl`](https://github.com/FluxML/Optimisers.jl).
(Be warned this package is unrelated to the `Flux.Optimisers` sub-module! The confusion is temporary.)
The functions `destructure` and `trainables` live in [`Optimisers.jl`](https://github.com/FluxML/Optimisers.jl).


```@docs
Optimisers.destructure
Optimisers.trainable
Optimisers.trainables
Optimisers.isnumeric
```

Expand Down
76 changes: 21 additions & 55 deletions docs/src/models/advanced.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Notice that we parameterized the type of the `chain` field. This is necessary fo
You can then use the model like:

```julia
chain = Chain(Dense(10, 10))
chain = Chain(Dense(10 => 10))
model = CustomModel(chain)
model(rand(10))
```
Expand All @@ -40,33 +40,37 @@ Taking reference from our example `Affine` layer from the [basics](@ref man-basi
By default all the fields in the `Affine` type are collected as its parameters, however, in some cases it may be desired to hold other metadata in our "layers" that may not be needed for training, and are hence supposed to be ignored while the parameters are collected. With Flux, the way to mark some fields of our layer as trainable is through overloading the `trainable` function:

```julia-repl
julia> @layer Affine
julia> struct Affine
W
b
end

julia> Affine(in::Int, out::Int) = Affine(randn(out, in), randn(out));

julia> (m::Affine)(x) = m.W * x .+ m.b;

julia> Flux.@layer Affine

julia> a = Affine(Float32[1 2; 3 4; 5 6], Float32[7, 8, 9])
Affine(Float32[1.0 2.0; 3.0 4.0; 5.0 6.0], Float32[7.0, 8.0, 9.0])

julia> Flux.params(a) # default behavior
Params([Float32[1.0 2.0; 3.0 4.0; 5.0 6.0], Float32[7.0, 8.0, 9.0]])
julia> Flux.trainable(a) # default behavior
(W = Float32[1.0 2.0; 3.0 4.0; 5.0 6.0], b = Float32[7.0, 8.0, 9.0])

julia> Flux.trainable(a::Affine) = (; W = a.W) # returns a NamedTuple using the field's name

julia> Flux.params(a)
Params([Float32[1.0 2.0; 3.0 4.0; 5.0 6.0]])
julia> Flux.trainable(a)
(W = Float32[1.0 2.0; 3.0 4.0; 5.0 6.0],)
```

Only the fields returned by `trainable` will be collected as trainable parameters of the layer when calling `Flux.params`, and only these fields will be seen by `Flux.setup` and `Flux.update!` for training. But all fields wil be seen by `gpu` and similar functions, for example:
Only the fields returned by `trainable` will be seen by `Flux.setup` and `Flux.update!` for training. But all fields wil be seen by `gpu` and similar functions, for example:

```julia-repl
julia> a |> f16
Affine(Float16[1.0 2.0; 3.0 4.0; 5.0 6.0], Float16[7.0, 8.0, 9.0])
```

Note that there is no need to overload `trainable` to hide fields which do not contain trainable parameters. (For example, activation functions, or Boolean flags.) These are always ignored by `params` and by training:

```julia-repl
julia> Flux.params(Affine(true, [10, 11, 12.0]))
Params([])
```
Note that there is no need to overload `trainable` to hide fields which do not contain numerical array (for example, activation functions, or Boolean flags). These are always ignored by training.

The exact same method of `trainable` can also be defined using the macro, for convenience:

Expand All @@ -76,52 +80,14 @@ Flux.@layer Affine trainable=(W,)

There is a second, more severe, kind of restriction possible. This is not recommended, but is included here for completeness. Calling `Functors.@functor Affine (W,)` means that all no exploration of the model will ever visit the other fields: They will not be moved to the GPU by [`gpu`](@ref), and their precision will not be changed by `f32`. This requires the `struct` to have a corresponding constructor that accepts only `W` as an argument.


## Freezing Layer Parameters

When it is desired to not include all the model parameters (for e.g. transfer learning), we can simply not pass in those layers into our call to `params`.

!!! compat "Flux ≤ 0.14"
The mechanism described here is for Flux's old "implicit" training style.
When upgrading for Flux 0.15, it should be replaced by [`freeze!`](@ref Flux.freeze!) and `thaw!`.

Consider a simple multi-layer perceptron model where we want to avoid optimising the first two `Dense` layers. We can obtain
this using the slicing features `Chain` provides:

```julia
m = Chain(
Dense(784 => 64, relu),
Dense(64 => 64, relu),
Dense(32 => 10)
);

ps = Flux.params(m[3:end])
```

The `Zygote.Params` object `ps` now holds a reference to only the parameters of the layers passed to it.

During training, the gradients will only be computed for (and applied to) the last `Dense` layer, therefore only that would have its parameters changed.

`Flux.params` also takes multiple inputs to make it easy to collect parameters from heterogenous models with a single call. A simple demonstration would be if we wanted to omit optimising the second `Dense` layer in the previous example. It would look something like this:

```julia
Flux.params(m[1], m[3:end])
```

Sometimes, a more fine-tuned control is needed.
We can freeze a specific parameter of a specific layer which already entered a `Params` object `ps`,
by simply deleting it from `ps`:

```julia
ps = Flux.params(m)
delete!(ps, m[2].bias)
```

## Custom multiple input or output layer

Sometimes a model needs to receive several separate inputs at once or produce several separate outputs at once. In other words, there multiple paths within this high-level layer, each processing a different input or producing a different output. A simple example of this in machine learning literature is the [inception module](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Szegedy_Rethinking_the_Inception_CVPR_2016_paper.pdf).

Naively, we could have a struct that stores the weights of along each path and implement the joining/splitting in the forward pass function. But that would mean a new struct any time the operations along each path changes. Instead, this guide will show you how to construct a high-level layer (like [`Chain`](@ref)) that is made of multiple sub-layers for each path.
We could have a struct that stores the weights of along each path and implement the joining/splitting in the forward pass function. That would mean a new struct for each different block,
e.g. one would have a `TransformerBlock` struct for a transformer block, and a `ResNetBlock` struct for a ResNet block, each block being composed by smaller sub-blocks. This is often the simplest and cleanest way to implement complex models.

This guide instead will show you how to construct a high-level layer (like [`Chain`](@ref)) that is made of multiple sub-layers for each path.

### Multiple inputs: a custom `Join` layer

Expand Down
76 changes: 24 additions & 52 deletions docs/src/models/basics.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,68 +74,40 @@ julia> Flux.withgradient(g, nt)
(val = 1, grad = ((a = [0.0, 2.0], b = [-0.0, -2.0], c = nothing),))
```

!!! note "Implicit gradients"
Flux used to handle many parameters in a different way, using the [`params`](@ref Flux.params) function.
This uses a method of `gradient` which takes a zero-argument function, and returns a dictionary
through which the resulting gradients can be looked up:

```jldoctest basics
julia> x = [2, 1];

julia> y = [2, 0];

julia> gs = gradient(Flux.params(x, y)) do
f(x, y)
end
Grads(...)

julia> gs[x]
2-element Vector{Float64}:
0.0
2.0

julia> gs[y]
2-element Vector{Float64}:
-0.0
-2.0
```


## Building Simple Models

Consider a simple linear regression, which tries to predict an output array `y` from an input `x`.

```julia
W = rand(2, 5)
b = rand(2)

predict(x) = W*x .+ b
predict(W, b, x) = W*x .+ b

function loss(x, y)
ŷ = predict(x)
function loss(W, b, x, y)
ŷ = predict(W, b, x)
sum((y .- ŷ).^2)
end

x, y = rand(5), rand(2) # Dummy data
loss(x, y) # ~ 3
W = rand(2, 5)
b = rand(2)

loss(W, b, x, y) # ~ 3
```

To improve the prediction we can take the gradients of the loss with respect to `W` and `b` and perform gradient descent.

```julia
using Flux

gs = gradient(() -> loss(x, y), Flux.params(W, b))
dW, db = gradient((W, b) -> loss(W, b, x, y), W, b)
```

Now that we have gradients, we can pull them out and update `W` to train the model.

```julia
W̄ = gs[W]
W .-= 0.1 .* dW

W .-= 0.1 .* W̄

loss(x, y) # ~ 2.5
loss(W, b, x, y) # ~ 2.5
```

The loss has decreased a little, meaning that our prediction `x` is closer to the target `y`. If we have some data we can already try [training the model](../training/training.md).
Expand All @@ -144,7 +116,7 @@ All deep learning in Flux, however complex, is a simple generalisation of this e

## Building Layers

It's common to create more complex models than the linear regression above. For example, we might want to have two linear layers with a nonlinearity like [sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function) (`σ`) in between them. In the above style we could write this as:
It's common to create more complex models than the linear regression above. For example, we might want to have two linear layers with a nonlinearity like [sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function) in between them. We could write this as:

```julia
using Flux
Expand All @@ -157,7 +129,7 @@ W2 = rand(2, 3)
b2 = rand(2)
layer2(x) = W2 * x .+ b2

model(x) = layer2(σ.(layer1(x)))
model(x) = layer2(sigmoid.(layer1(x)))

model(rand(5)) # => 2-element vector
```
Expand All @@ -174,7 +146,7 @@ end
linear1 = linear(5, 3) # we can access linear1.W etc
linear2 = linear(3, 2)

model(x) = linear2(σ.(linear1(x)))
model(x) = linear2(sigmoid.(linear1(x)))

model(rand(5)) # => 2-element vector
```
Expand All @@ -188,7 +160,7 @@ struct Affine
end

Affine(in::Integer, out::Integer) =
Affine(randn(out, in), randn(out))
Affine(randn(out, in), zeros(out))

# Overload call, so the object can be used as a function
(m::Affine)(x) = m.W * x .+ m.b
Expand All @@ -198,16 +170,16 @@ a = Affine(10, 5)
a(rand(10)) # => 5-element vector
```

Congratulations! You just built the `Dense` layer that comes with Flux. Flux has many interesting layers available, but they're all things you could have built yourself very easily.
Congratulations! You just built the [`Dense`](@ref) layer that comes with Flux. Flux has many interesting layers available, but they're all things you could have built yourself very easily.

(There is one small difference with `Dense` – for convenience it also takes an activation function, like `Dense(10 => 5, σ)`.)
(There is one small difference with `Dense` – for convenience it also takes an activation function, like `Dense(10 => 5, sigmoid)`.)

## Stacking It Up

It's pretty common to write models that look something like:

```julia
layer1 = Dense(10 => 5, σ)
layer1 = Dense(10 => 5, relu)
# ...
model(x) = layer3(layer2(layer1(x)))
```
Expand All @@ -217,7 +189,7 @@ For long chains, it might be a bit more intuitive to have a list of layers, like
```julia
using Flux

layers = [Dense(10 => 5, σ), Dense(5 => 2), softmax]
layers = [Dense(10 => 5, relu), Dense(5 => 2), softmax]

model(x) = foldl((x, m) -> m(x), layers, init = x)

Expand All @@ -228,7 +200,7 @@ Handily, this is also provided for in Flux:

```julia
model2 = Chain(
Dense(10 => 5, σ),
Dense(10 => 5, relu),
Dense(5 => 2),
softmax)

Expand All @@ -255,22 +227,22 @@ m(5) # => 26

## Layer Helpers

There is still one problem with this `Affine` layer, that Flux does not know to look inside it. This means that [`Flux.train!`](@ref) won't see its parameters, nor will [`gpu`](@ref) be able to move them to your GPU. These features are enabled by the [`@layer`](@ref Flux.@layer) macro:
There is still one problem with this `Affine` layer, that Flux does not know to look inside it. This means that [`Flux.train!`](@ref Flux.train!) won't see its parameters, nor will [`gpu`](@ref) be able to move them to your GPU. These features are enabled by the [`@layer`](@ref Flux.@layer) macro:

```julia
Flux.@layer Affine
```

Finally, most Flux layers make bias optional, and allow you to supply the function used for generating random weights. We can easily add these refinements to the `Affine` layer as follows, using the helper function [`create_bias`](@ref Flux.create_bias):

```
function Affine((in, out)::Pair; bias=true, init=Flux.randn32)
```julia
function Affine((in, out)::Pair; bias=true, init=glorot_uniform)
W = init(out, in)
b = Flux.create_bias(W, bias, out)
Affine(W, b)
return Affine(W, b)
end

Affine(3 => 1, bias=false, init=ones) |> gpu
Affine(3 => 1, bias=false) |> gpu
```

```@docs
Expand Down
Loading
Loading