π΅π» Model 1: AlexNet : Image Classification
Paper : ImageNet Classification with Deep Convolutional Neural Networks Talk : NIPS 2012 ; Slide :link
2012 ILSVRC (ImageNet Large-Scale Visual Recognition Challenge) Winner.
It is a benchmark competition where teams across the world compete to classify, localize, detect ... images of 1000 categories, taken from the imagenet dataset.The imagenet dataset holds 15M images of 22K categories but for this contest: 1.2M images in 1K categories were chosen.Their goal was, Classification i.e, make 5 guesses about label for an image.Team "SuperVision (AlexNet) ", achieved top 5 test error rate of 15.4% ( next best entry achieved an error of 26.2% ) more than 10.8 percentage points ahead of the runner up. This was a huge success. Check ILSVRC 2012 results.
This paper is important as it stands as a stepping stone for CNNs in Computer Vision community. It was record breaking, new and exciting.
AlexNet is a Convolutional Neural Network architecture, introduced in 2012 by Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton. It has 7 hidden weight layers & contains β 650,000 neurons β 60,000,000 parameters β 630,000,000 connections. In simple terms, it is a model to correctly classify images. Later in 2014, Alex once again shows a unique way to parrallelize CNNs in his paper, "One weird trick for parallelizing convolutional neural networks"
Alexnet contained only 8 layers, first 5 were convolutional layers followed by fully connected layers. It had max-pooling layers and dropout layers in between. A simple skeleton looks like :
But wait,
What are Convolutional, Fully Connected, Max-pooling (P), Dropout & Normalization (N) Layers ? Read it here, I have explained everything in detail or else you can also read it in CS231n's blog on CNN.
The Network had a very similar architecture to LeNet, but was deeper, bigger, and featured Convolutional Layers stacked on top of each other (previously it was common to only have a single CONV layer always immediately followed by a POOL layer). The Architecture can be summarized as :
( Image ) ->CONV1->P1->N1 ->CONV2->P2->N2 ->CONV3 ->CONV4 ->CONV5->P3 ->FC6 ->FC7 ->FC8 -> ( Label )
But why does the architecture diagram in the paper looks so scary ?
It is because, the figure shows training as well, training was done in 2 GPUs. One GPU runs the layer parts at the top of the figure while the other runs the layer parts at the bottom. The GPUs communicate only at certain layers.The communication overhead is kept low and this helps to achieve good performance overall. You can check this slide for future reference. Also, these comparisons are handy.
Input Image size : 227 x 227 x 3
(paper says - 224 x 224 , but there's some padding going on, 227 works)
β CONV1
Output (from Conv1): 55 x 55 x 96 //55 = (227-11)/(4+1) = (Image size - Filter size)/stride+1
First Layer Conv1 has 96 11x11x3 filters at stride 4, pad 0
Output (from Pool1): 27 x 27 x 96
Max Pool 1 has 3 x 3 filter applied at stride 2
Ouput ( from Normalization Layer ): 27 x 27 x 96
βCONV2
Output (from Conv2): 27 x 27 x 256
Second Layer Conv2 has 256 5x5x48 filters at stride 1, pad 2
Output (from Pool2): 13 x 13 x 256
Max Pool 2 has 3 x 3 filter applied at stride 2
Ouput ( from Normalization Layer ): 13 x 13 x 256
βCONV3
Output (from Conv3): 13 x 13 x 384
Third Layer Conv3 has 384 3x3x256 filters at stride 1, pad 1
βCONV4
Output (from Conv4): 13 x 13 x 384
Fourth Layer Conv4 has 384 3x3x192 filters at stride 1, pad 1
βCONV5
Output (from Conv5): 13 x 13 x 256
Fifth Layer Conv5 has 256 3x3x192 filters at stride 1, pad 1
Output (from Pool3): 6 x 6 x 256
Max Pool 3 has 3 x 3 filter applied at stride 2
βFC6
Fully Connected Layer 6 : 4096 neurons
βFC7
Fully Connected Layer 7 : 4096 neurons
βFC8
Fully Connected Layer 7 : 1000 neurons ( class scores )
β uses ReLu(Rectified Linear Unit) for the non-linear part, instead of a Tanh or Sigmoid function which
was the earlier standard for traditional neural networks.
β ReLU non linearity is applied to the output of every convolutional layer and fully connected layer.
reducing the over-fitting by using a Dropout layer after every FC layer.
β Rectified Linear Units (first use), overlapping pooling, dropout (0.5) trick to avoid overfitting
β Layer 1 (Convolutional) : 55*55*96 = 290,400 neurons & each has 11*11*3 = 363 weights and 1 bias i.e,
290400 * 364 = 105,705,600 paramaters on the first layer of the AlexNet alone!
β Training on multiple GPUs ( 2 NVIDIA GTX 580 3 GB GPU ) for 5-6 days.
Top-1 and Top-5 error rates decreases by 1.7% & 1.2% respectively comparing to the net trained with
one GPU and half neurons!!
β Local Response Normalization
Response normalization reduces top-1 and top-5 error rates by 1.4% and 1.2% , respectively.
β Overlapping Pooling ( s x z , where s < z ) compared to the non-overlapping scheme s = 2, z = 2
top-1 and top-5 error rates decrease by 0.4% and 0.3%, respectively.
overlap pooling makes it hard to overfit.
β Reducing Overfitting
Heavy Data Augmentation!
- 60 million parameters, 650,000 neurons (Overfits a lot.)
- Crop 224x224 patches (and their horizontal reflections.)
- At test time, average the predictions on the 10 patches.
β Reducing Overfitting
- Dropout
β Stochastic Gradient Descent (SGD) Learning
β batch size = 128
β 96 Convolutional Kernels ( 11 x 11 x 3 size kernels. ) - CONV1, CONV2, CONV4 & CONV5:
- top 48 kernels on GPU 1 : color-agnostic
- bottom 48 kernels on GPU 2 : color-specific.
β CONV3, FC1 & FC2 - Connection with all feature maps in preceding layers.
β In the paper, they say "Depth is really important.removing a single convolutional layer degrades
the performance."
Net | Backend | Weights |
---|---|---|
AlexNet | Tensorflow | Weights |
AlexNet | Caffe | Weights |
Let's build the Alexnet in Keras ('tf' backend) and test it on COCO-dataset dataset. We will develop all the three methods and train the dataset. The three methods are:
1. Train from Scratch ( End2End )
2. Transfer Learning
3. Feature Extraction
I have explained here what the three methods mean, and how it is to be done.
β Approach: Will update soon. Training ...
RESULTS:
CS231n : Lecture 9 | CNN Architectures - AlexNet
If you find something amusing, don't forget to share with us. Create an issue and let us know.