This repository the homework 1 for the Artificial Intelligence Safety course at Zhejiang University.
Instructor: Jie Song
Requirements:
- Creating your own github account.
- Implementing your own deep neural network (in Pytorch, PaddlePaddle...).
- Training it on CIFAR10.
- Tuning a hyper-parameter and analyzing its effects on performance.
- Writing a README.md to report your findings.
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.
- CPU: E5-2630L v3
- GPU: RTX3060
- RAM: 30GB
- VRAM: 12GB
- python3
- pytorch
- numpy
- matplotlib
- tqdm
I use the model from PyTorch tutorial for CIFAR-10. This model contains 2 convolution layers, 1 pooling layer and 3 fully-connected layers.
class Net(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(nn.functional.relu(self.conv1(x)))
x = self.pool(nn.functional.relu(self.conv2(x)))
x = torch.flatten(x, 1) # flatten all dimensions except batch
x = nn.functional.relu(self.fc1(x))
x = nn.functional.relu(self.fc2(x))
x = self.fc3(x)
return x
Cross entropy is chosen as the loss function for model training.
First I used SGD(Stochastic gradient descent) as the optimizer. The learning rate is set to 0.01. The batch size is set to 128.
The test set is referred to as the validation set here.
The acc@1 of test set is 67.04%.
The learning curve is shown below.
With the increase of epoch, train loss decreased continuously, but test loss decreased first and then increased. The performance was the best at about 70 rounds. Before 70 rounds, it was under-fitting, and after 70 rounds, it was over-fitting. As the training goes on, the complexity of the model will increase, and over-fitting is easy to occur when the model is overtrained.
The number of epochs is a hyperparameter, too large may lead to over-fitting. Methods such as early stopping and weight decay can be adopted to avoid this situation.
The initial learning rate was set to a very small value (1e-5) and then exponentially increased at each step (1.05 times) for training. I trained for 236 steps(1.05^236*1e-5 = 1.002), and we can see that the train loss first decreases and then increases, just like a hook.
I chose a learning rate of 0.1 and started training again(It is best to choose the learning rate when loss decreases fastest).
The acc@1 of test set is 61.62%. The learning curve is shown below.
Compared with the results when the learning rate was 0.01, we can see that the convergence process is much faster, the optimum is reached in about 10 steps. However, the accuracy is lower.
Small learning rate will slow down the convergence, and large learning rate may be difficult to make the model reach the optimal.
Momentum is an optimization of the gradient descent. Every movement considers all past gradients.
Momentum can help escape critical points (saddle points or local minima) and accelerate convergence.
Weight decay is L2 regularization, which reduces the problem of over-fitting.
I updated the optimizer parameters (momentum
and weight_decay
here are also hyper-parameters) and started training again.
optim.SGD(model.parameters(), lr=LEARNING_RATE, momentum=0.9, weight_decay=5e-4)
The results are as follows. The test result is 61.45%.
Compared to the previous results, the convergence speed becomes faster (because of momentum), and the problem of over-fitting is also alleviated (because of weight decay).
Then I changed the optimizer to Adam, which combines RMSprop and Momentum.
The acc@1 of test set is 65.60%.
I also tried different batch sizes.
Batch size: 32. Acc@1: 64.77%. Elapsed time: 1389s.
Batch size: 64. Acc@1: 64.76%. Elapsed time: 845s.
Batch size: 128. Acc@1: 66.37%. Elapsed time: 552s.
Batch size: 256. Acc@1: 64.54%. Elapsed time: 491s.
Batch size | Acc@1 | Time |
---|---|---|
32 | 64.77% | 1389s |
64 | 64.76% | 845s |
128 | 66.37% | 552s |
256 | 64.54% | 491s |
Theoretically, small batch size is easier to optimize and has better generalization ability, and with a larger batch size, each epoch of training takes less time and training is faster.
In this case, the batch size has no effect on the test acc, the possible reason is that all of them have reached the optimal value.
It can be seen that the optimization effect is not obvious after adjusting parameters. Train loss and test loss are both relatively high, which should be Model Bias.
Next, I used some more complex models.
First, I implemented my own CNN, mainly by increasing output channals of the convolutional layer and adding Batch normalization.
class MyCNN(nn.Module):
def __init__(self):
super(MyCNN, self).__init__()
# The arguments for commonly used modules:
# torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)
# torch.nn.MaxPool2d(kernel_size, stride, padding)
# input image size: [3, 32, 32]
self.cnn_layers = nn.Sequential(
nn.Conv2d(3, 64, 3, 1, 1), # input: 32*32*3, output: 32*32*64
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2, 2, 0), # input: 32*32*64, output: 16*16*64
nn.Conv2d(64, 128, 3, 1, 1), # input: 16*16*64, output: 16*16*128
nn.BatchNorm2d(128),
nn.ReLU(),
nn.MaxPool2d(2, 2, 0), # input: 16*16*128, output: 8*8*128
)
self.fc_layers = nn.Sequential(
nn.Linear(128 * 8 * 8, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
def forward(self, x):
# input (x): [batch_size, 3, 32, 32]
# output: [batch_size, 10]
# Extract features by convolutional layers.
x = self.cnn_layers(x)
# The extracted feature map must be flatten before going to fully-connected layers.
x = x.flatten(1)
# The features are transformed by fully-connected layers to obtain the final logits.
x = self.fc_layers(x)
return x
The acc@1 on test set is 76.37%, better than before.
I continued to increase the complexity of the model.
class MyCNN(nn.Module):
def __init__(self):
super(MyCNN, self).__init__()
# The arguments for commonly used modules:
# torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)
# torch.nn.MaxPool2d(kernel_size, stride, padding)
# input image size: [3, 32, 32]
self.cnn_layers = nn.Sequential(
nn.Conv2d(3, 64, 3, 1, 1), # input: 32*32*3, output: 32*32*64
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2, 2, 0), # input: 32*32*64, output: 16*16*64
nn.Conv2d(64, 128, 3, 1, 1), # input: 16*16*64, output: 16*16*128
nn.BatchNorm2d(128),
nn.ReLU(),
nn.MaxPool2d(2, 2, 0), # input: 16*16*128, output: 8*8*128
nn.Conv2d(128, 256, 3, 1, 1), # intput: 8*8*128, output: 8*8*256
nn.BatchNorm2d(256),
nn.ReLU(),
nn.MaxPool2d(2, 2, 0), # output: 4*4*256
)
self.fc_layers = nn.Sequential(
nn.Linear(256 * 4 * 4, 256),
nn.ReLU(),
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(256, 10)
)
def forward(self, x):
# input (x): [batch_size, 3, 32, 32]
# output: [batch_size, 10]
# Extract features by convolutional layers.
x = self.cnn_layers(x)
# The extracted feature map must be flatten before going to fully-connected layers.
x = x.flatten(1)
# The features are transformed by fully-connected layers to obtain the final logits.
x = self.fc_layers(x)
return x
The accuracy improved to 80.66%.
I also tried other CNN models.
- ResNet-18
- ResNet-50
- ResNet-152
- VGG-19 (failed)
- VGG-19 with batch normalization
- Densenet-121
- Densenet-201
Model | Acc@1 | Time |
---|---|---|
My CNN | 80.66% | 679s |
ResNet-18 | 76.64% | 1766s |
ResNet-50 | 78.59% | 3859s |
ResNet-152 | 77.95% | 8140s |
VGG-19 with BN | 85.43% | 4990s |
Densenet-121 | 80.03% | 7273s |
Densenet-201 | 79.73% | 10902s |
I didn't change the optimizer and its parameters, so the results are subject to error.
Loss on training data is small but loss on testing data is large, its probably over-fitting.
I use data augmentation to alleviate this problem.
transform_train = transforms.Compose([
transforms.RandomCrop(32,padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean = [ 0.4914, 0.4822, 0.4465 ],
std = [ 0.2023, 0.1994, 0.2010 ]),
])
transform_test = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(mean = [ 0.4914, 0.4822, 0.4465 ],
std = [ 0.2023, 0.1994, 0.2010 ]),
])
- My CNN
- ResNet-18
- ResNet-50
- ResNet-152
- VGG-19 (failed)
- VGG-19 with batch normalization
- Densenet-121
- Densenet-201
Model | Acc@1 | Time |
---|---|---|
My CNN | 86.81% | 692s |
ResNet-18 | 84.90% | 3597s |
ResNet-50 | 85.67% | 7803s |
ResNet-152 | 86.02% | 19065s |
VGG-19 with BN | 89.49% | 10036s |
Densenet-121 | 85.86% | 13772s |
Densenet-201 | 86.44% | 22602s |
I didn't change the optimizer and its parameters, so the results are subject to error.
We can see that the acc@1 on testing data is improved, and the data augmentation does alleviate the problem of over-fitting.
For more details, see the ZJU Artificial Intelligence Safety HW01.ipynb