Activation Functions Overview

Activation Functions

Activation Functions used in (Deep) Neural Networks and their properties:

Note that there is no solid theory of how activation functions should be chosen. There are infinitely many possible activation functions. Which is the best for any given problem? It is unknown.

However, Glorot and Bengio (2010) investigated the effect of various activation functions with respect to the Vanishing Gradient problem and found these conclusions:

The more classical neural networks with sigmoid or hyperbolic tangent units and standard initialization fare rather poorly, converging more slowly and appar- ently towards ultimately poorer local minima.
Sigmoid activations (not symmetric around 0) should be avoided when initializing from small random weights, because they yield poor learning dynamics, with initial saturation of the top hidden layer.
The softsign networks seem to be more robust to the initialization procedure than the tanh networks, pre- sumably because of their gentler non-linearity.
For tanh networks, the proposed normalized initialization can be quite helpful, presumably because the layer-to-layer transformations maintain magnitudes of activations (flowing upward) and gradients (flowing backward).
Others methods can alleviate discrepancies between lay- ers during learning, e.g., exploiting second order information to set the learning rate separately for each parameter.

Step Function

sharp step from 0 to 1. used by the Perceptron. not useful for non-binary problems

Step Function

Sigmoid

most basic function. soft (smooth) version of the step function

Sigmoid Function

Problem: learning slowdown when output neurons saturate (i.e. are close to 0 or 1) when using the quadratic cost to train (because the derivative of the sigmoid becomes small near 0 or 1, which means small updates even when the error is really big). Using Cross-Entropy Cost Function helps, see http://neuralnetworksanddeeplearning.com/chap3.html#the_cross-entropy_cost_function

TanH

TanH Function: like Sigmoid but from [-1,1] instead of [0,1]

means that output data has to be re-scaled accordingly to [-1,1]

TanH Function

used when?

Softmax

TL: The softmax function is a "transfer" function that "balances" (adjusts) the activations of other neurons given the input of one neuron, so that the activations of all neurons in a layer sum up to 1.

The output from the softmax layer can be thought of as a probability distribution.

The fact that a softmax layer outputs a probability distribution is rather pleasing. In many problems it's convenient to be able to interpret the output activation aLj as the network's estimate of the probability that the correct output is j. So, for instance, in the MNIST classification problem, we can interpret aLj as the network's estimated probability that the correct digit classification is j.

By contrast, if the output layer was a sigmoid layer, then we certainly couldn't assume that the activations formed a probability distribution.

see http://neuralnetworksanddeeplearning.com/chap3.html#softmax to try out the impact of the Softmax function with sliders

Rectifier Function (ReLU)

A unit employing the rectifier is also called a rectified linear unit (ReLU).

Rectifier Function: f(z) = max(0,z) , more precisely max(0,w⋅x+b), with input x, weight vector w, and bias b

Rectifier function

(blue: Rectifier, green: Softplus (see below))

This activation function has been argued to be more biologically plausible than the widely used logistic sigmoid and its more practical counterpart, the hyperbolic tangent.

The rectifier is, as of 2015, the most popular activation function for deep neural networks.

https://en.wikipedia.org/wiki/Rectifier_(neural_networks)

For the first time in 2011, the use of the rectifier as a non-linearity has been shown to enable training deep supervised neural networks without requiring unsupervised pre-training. Rectified linear units, compared to sigmoid function or similar activation functions, allow for faster and effective training of deep neural architectures on large and complex datasets. The common trait is that they implement local competition between small groups of units within a layer (max(x,0) can be interpreted as competition with a fixed value of 0), so that only part of the network is activated for any given input pattern.

Advantages

Biological plausibility: One-sided, compared to the antisymmetry of tanh.
Sparse activation: For example, in a randomly initialized networks, only about 50% of hidden units are activated (having a non-zero output).
Efficient gradient propagation: No vanishing gradient problem or exploding effect.
Efficient computation: Only comparison, addition and multiplication.

Extensions of ReLU

These newer variations of the ReLU provide some advantages over the original ReLU:

Leaky ReLUs allow a small, non-zero gradient when the unit is not active:
ELU
SELU

Softplus Function

A smooth approximation to the rectifier is the analytic function

f(x) = ln(1 + e^x)

which is called the softplus function.

See green line in ReLU image above.

The derivative of softplus is f'(x) = 1 / (1 + e^{-x}), i.e. the logistic (sigmoid) function.

Note: Activation functions & Cost functions

Activation functions are in close relation to the Cost function used!

E.g. a sigmoid has the learning slowdown problem when a quadratic cost function is used, but not when using cross-entropy cost. Also, a softmax output layer with log-likelihood cost can be considered as being quite similar to a sigmoid output layer with cross-entropy cost.

Given this similarity, should you use a sigmoid output layer and cross-entropy, or a softmax output layer and log-likelihood? In fact, in many situations both approaches work well.

As a more general point of principle, softmax plus log-likelihood is worth using whenever you want to interpret the output activations as probabilities. That's not always a concern, but can be useful with classification problems (like MNIST) involving disjoint classes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Activation Functions Overview

Activation Functions

Step Function

Sigmoid

TanH

Softmax

Rectifier Function (ReLU)

Extensions of ReLU

Softplus Function

Note: Activation functions & Cost functions

Also see:

Clone this wiki locally