-
Notifications
You must be signed in to change notification settings - Fork 1
XNOR Nets: ImageNet Classification Using Binary Convolutional Neural Networks
The two models presented:
In Binary-Weight-Networks, the (convolution) filters are approximated with binary values resulting in 32 x memory saving.
In XNOR-Networks, both the filters and the input to convolutional layers are binary. ... This results in 58 x faster convolutional operations...
Implications:
XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time.
For future discussions we use the following mathematical notation for a CNN layer:
$$ \mathcal{I}{l(l=1,...,L)} = \mathbf{I}\in \mathbb{R} ^{c \times w{\text{in}} \times h_{\text{in}}}$$
$$ \mathcal{W}_{lk(k=1,...,K^l)}=\mathbf{W} \in \mathbb{R} ^{c \times w \times h} $$
$$ \ast\text{ : convolution} $$
$$ \oplus\text{ : convolution without multiplication} $$
$$ \otimes \text{ : convolution with XNOR and bitcount} $$
$$ \odot \text{ : elementwise multiplication} $$
In binary convolutional networks, we estimate the convolution filter weight as $$ \mathbf{W}\approx\alpha \mathbf{B}
To find an optimal estimation for $$ \mathbf{W}\approx\alpha \mathbf{B} $$ we solve the following problem:
$$ \alpha ^,\mathbf{B}^ =\underset{\alpha, \mathbf{B}}{\text{argmin}}J(\mathbf{B},\alpha) $$
Going straight to the answer:
The gradients are computed as follows:
where
The gradient values are kepted as real values; they cannot be binarized due to excessive information loss. Optimization is done by either SGD with momentum or ADAM.
Convolutions are a set of dot products between a submatrix of the input and a filter. Thus we attempt to express dot products in terms of binary operations.
For vectors
We solve the following optimization problem:
$$\alpha^, \mathbf{H}^, \beta^, \mathbf{B}^=\underset{\alpha, \mathbf{H}, \beta, \mathbf{B}}{\text{argmin}} \Vert \mathbf{X} \odot \mathbf{W} - \beta \alpha \mathbf{H} \odot \mathbf{B} \Vert$$
Going straight to the answer:
Calculating
which is an average over absolute values of
This
A CNN block in XNOR-Net has the following structure:
[Binary Normalization] - [Binary Activation] - [Binary Convolution] - [Pool]
The BinNorm layer normalizes the input batch by its mean and variance. The BinActiv layer calculates
The paper implemented the AlexNet, the Residual Net, and a GoogLenet variant(Darknet) with binary convolutions. This resulted in a few percent point of accuracy decrease, but overall worked fairly well. Refer to the paper for details.
Binary convolutions were not at all entirely binary; the gradients had to be real values. It would be fascinating if even the gradient is binarizable.
2019 Deepest Season 5 (2018.12.29 ~ 2019.06.15)